# Stanford Law Faculty Publications: Prototype System

### Sudip Das
### Last Updated: 11 Dec, 2025

In [44]:
import pandas as pd
import numpy as np
import re
import os
from openai import OpenAI
from google.colab import userdata
import json
import requests

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [45]:
file_path = "/content/drive/My Drive/Stanford/Faculty Publications-Grid view.csv"

df = pd.read_csv(file_path)
df.head()

Unnamed: 0,ID,Faculty Contributor,Faculty Co-Authors/Editors,Co-Authors,Title,Publication Type,Contribution,Date,Year,Status,...,ISSN,Abstract,Source Link(s),Stanford Link,DOI,PURL,SSRN,In ORCID,Corrected,Designation
0,500784.0,"Ablavsky, Gregory",,,Speculation Nation: Land Mania in the Revoluti...,Book Review,Writer,2024-12,2024.0,Published,...,,,,https://law.stanford.edu/publications/speculat...,,,,,checked,Faculty
1,513224.0,"Ablavsky, Gregory",,Felix S. Cohen,Cohen's Handbook of Federal Indian Law,"Book, Whole",Editor,2024-10,2024.0,Published,...,,,https://store.lexisnexis.com/en-us/products/co...,https://law.stanford.edu/publications/cohens-h...,,,,,checked,Faculty
2,493786.0,"Ablavsky, Gregory",,,The Original Meaning of Commerce in the Indian...,Journal Article,Writer,2024,2024.0,Published,...,,"In Haaland v. Brackeen, the Supreme Court retu...",,https://law.stanford.edu/publications/the-orig...,,,,,checked,Faculty
3,437598.0,"Ablavsky, Gregory",,,Clarence Thomas Went After My Work. His Critic...,Op-Ed or Opinion Piece,Writer,2023-06-20,2023.0,Published,...,,If judges are going to use history as their gu...,https://slate.com/news-and-politics/2023/06/cl...,https://law.stanford.edu/publications/clarence...,,,,,checked,Faculty
4,407547.0,"Ablavsky, Gregory",,,Akhil Amar's Unusable Past,Book Review,Writer,2023,2023.0,Published,...,,,,https://law.stanford.edu/publications/akhil-am...,,,,,checked,Faculty


In [37]:
print("Shape (rows, columns):", df.shape)
print(df.columns.tolist())

Shape (rows, columns): (1927, 31)
['ID', 'Faculty Contributor', 'Faculty Co-Authors/Editors', 'Co-Authors', 'Title', 'Publication Type', 'Contribution', 'Date', 'Year', 'Status', 'Serial Title', 'Publisher', 'Publication Title', 'Editor(s)', 'Citation', 'Italics', 'Volume', 'Issue', 'Pages', 'Edition', 'ISBN', 'ISSN', 'Abstract', 'Source Link(s)', 'Stanford Link', 'DOI', 'PURL', 'SSRN', 'In ORCID', 'Corrected', 'Designation']


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1927 entries, 0 to 1926
Data columns (total 31 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          1926 non-null   float64
 1   Faculty Contributor         1927 non-null   object 
 2   Faculty Co-Authors/Editors  96 non-null     object 
 3   Co-Authors                  560 non-null    object 
 4   Title                       1927 non-null   object 
 5   Publication Type            1927 non-null   object 
 6   Contribution                1699 non-null   object 
 7   Date                        1927 non-null   object 
 8   Year                        1926 non-null   float64
 9   Status                      1316 non-null   object 
 10  Serial Title                1440 non-null   object 
 11  Publisher                   1680 non-null   object 
 12  Publication Title           251 non-null    object 
 13  Editor(s)                   127 n

In [6]:
display(df.describe(include=['object']).T)

Unnamed: 0,count,unique,top,freq
Faculty Contributor,1927,87,"Mello, Michelle M.",116
Faculty Co-Authors/Editors,96,28,"""Engstrom, David Freeman""",15
Co-Authors,560,409,Nicholson Price; Rachel Sachs; Jacob S. Sherkow,15
Title,1927,1741,Stanford Law Faculty on the Historic Confirmat...,7
Publication Type,1927,13,Journal Article,895
Contribution,1699,3,Writer,1688
Date,1927,911,2023,84
Status,1316,2,Published,1279
Serial Title,1440,488,Written Description Blog,45
Publisher,1680,328,Stanford Law School,114


### Initial data profile – key observations

**Size & structure**: 1,927 rows, 31 columns. Mostly object (string) columns; a few numeric fields

**Quality**: Few columns such as Title, Faculty Contributor is complete while columns such as ISSN, DOI and Edition are missing most of the data. The goal would be to check whether these can be collected from other sources or missing from the original source.

**ID**: This will be the key to query or retrieve any publications however there is one missing value

In [21]:
df['ID'].isna().sum()

np.int64(1)

In [22]:
df['ID'].nunique()

1786

## Standardizing Names

Name format: Last Name, First Name for Faculty Contributor

Faculty Co-Authors/Editors in quotations to separate multiple names

Co-Authors are separated by colon and wrote in First name Last Name

Names should have consistent format

the same person might be referred in different ways across entries (middle initials, accent marks in names, etc.), requiring normalization or having a separate column such as Batch ID if Stanford has something similar

In [46]:
#Standardizing faculty Co-Authors/Editors column
df['Faculty Co-Authors/Editors'] = (
    df['Faculty Co-Authors/Editors']
      .str.replace(r'"\s*,\s*"', '; ', regex=True)
      .str.replace('"', '', regex=False)
      .str.strip()
)
df[df['Faculty Co-Authors/Editors'].notna()]['Faculty Co-Authors/Editors'].head(10)

Unnamed: 0,Faculty Co-Authors/Editors
10,"Reese, Elizabeth Hidalgo"
35,"Ouellette, Lisa Larrimore"
65,"Ford, Richard Thompson"
78,"Sykes, Alan O."
81,"Fried, Barbara"
87,"McConnell, Michael W.; Gould IV, William B.; S..."
279,"Engstrom, Nora Freeman"
280,"Engstrom, Nora Freeman"
284,"Engstrom, Nora Freeman"
285,"Engstrom, Nora Freeman"


In [47]:
# Standardizing Co-Authors and Editors column
def transform_name_first_last(name: str) -> str:
    """
    Transform 'First Middle Last' -> 'Last, First Middle'
    """
    if not isinstance(name, str):
        return name

    name = name.strip()
    if not name:
        return name

    parts = name.split()
    if len(parts) == 1:
        return name

    last = parts[-1]
    first_middle = ' '.join(parts[:-1])
    return f"{last}, {first_middle}"


def normalize_coauthors(value: str, separators_pattern: str) -> str:
    """
    Split an author string on the given separators pattern,
    normalize each name as 'Last, First', and join with '; '.
    """
    if pd.isna(value) or not str(value).strip():
        return value

    text = str(value).strip()

    # Split on the provided separators (e.g. ';' or '[;,]')
    authors = re.split(separators_pattern, text)
    authors = [a.strip() for a in authors if a.strip()]

    transformed = [transform_name_first_last(a) for a in authors]
    return '; '.join(transformed)

# Co-Authors: authors separated by ';'
df["Co-Authors"] = df["Co-Authors"].apply(
    lambda s: normalize_coauthors(s, r';')
)

# Editor(s): authors separated by ';' OR ','
df["Editor(s)"] = df["Editor(s)"].apply(
    lambda s: normalize_coauthors(s, r'[;,]')
)


print(df["Co-Authors"].dropna().head(5))

print(df["Editor(s)"].dropna().head(5))

1                   Cohen, Felix S.
7                Allread, W. Tanner
30    Deer, Sarah; Richland, Justin
42               Sirleaf, Matiangai
43                   McDougall, Gay
Name: Co-Authors, dtype: object
1                 Berger, Bethany R.; Blackhawk, Maggie
30    Stern, Simon; Mar, Maksymilian Del; Meyler, Be...
37                                     Chetail, Vincent
38                                     Chetail, Vincent
46    Binder, Christina; Nowak, Manfred; Hofbauer, J...
Name: Editor(s), dtype: object


In [48]:
# Validation checks
print(df["Co-Authors"].dropna().head(10))

print(df["Editor(s)"].dropna().head(10))

1                   Cohen, Felix S.
7                Allread, W. Tanner
30    Deer, Sarah; Richland, Justin
42               Sirleaf, Matiangai
43                   McDougall, Gay
44               Gathii, James Thuo
46                      Lake, Diane
47                 Devakumar, Delan
51                     Last, Tamara
53                       Bâli, Aslı
Name: Co-Authors, dtype: object
1                  Berger, Bethany R.; Blackhawk, Maggie
30     Stern, Simon; Mar, Maksymilian Del; Meyler, Be...
37                                      Chetail, Vincent
38                                      Chetail, Vincent
46     Binder, Christina; Nowak, Manfred; Hofbauer, J...
56     Costello, Cathryn; Foster, Michelle; McAdam, Jane
84                  Grossman, Joanna L.; Kim, Suzanne A.
93           Fontenay, Elisabeth D. de; Broughman, Brian
94                               Cumming, D.; Hammer, B.
100          Fontenay, Elisabeth D. de; Broughman, Brian
Name: Editor(s), dtype: object


## Normalizating Text/String fields

In [49]:
import html
import unicodedata
def clean_text_basic(text: str) -> str:
    """
    Basic text normalization:
    - Handle NaN safely
    - Strip leading/trailing whitespace
    - Remove HTML tags
    - Unescape HTML entities
    - Normalize unicode (NFKC)
    - Normalize quotes/dashes
    - Collapse multiple spaces
    """
    if pd.isna(text):
        return text

    s = str(text)

    s = s.strip()# Strip outer whitespace
    s = re.sub(r"<[^>]+>", "", s) # Remove HTML tags

    s = html.unescape(s)
    s = unicodedata.normalize("NFKC", s)

    # Normalize curly quotes and dashes to simpler forms
    replacements = {
        "“": '"',
        "”": '"',
        "‘": "'",
        "’": "'",
        "–": "-",   # en dash
        "—": "-",   # em dash
        "\u00a0": " ",  # non-breaking space
    }
    for old, new in replacements.items():
        s = s.replace(old, new)
    s = re.sub(r"\s+", " ", s)

    return s

In [50]:
text_cols = [
    "Title",
    "Serial Title",
    "Publisher",
    "Publication Title",
    "Citation",
]

for col in text_cols:
    df[col] = df[col].apply(clean_text_basic)

# Quick before/after check on a few rows
df[["Title", "Publisher", "Citation",]].head(30)

Unnamed: 0,Title,Publisher,Citation
0,Speculation Nation: Land Mania in the Revoluti...,Oxford University Press,"Gregory Ablavsky, Speculation Nation: Land Man..."
1,Cohen's Handbook of Federal Indian Law,LexisNexis,"Fᴇʟɪx S. Cᴏʜᴇɴ, Cᴏʜᴇɴ'ꜱ Hᴀɴᴅʙᴏᴏᴋ ᴏꜰ Fᴇᴅᴇʀᴀʟ Iɴ..."
2,The Original Meaning of Commerce in the Indian...,University of Connecticut School of Law,"Gregory Ablavsky, The Original Meaning of Comm..."
3,Clarence Thomas Went After My Work. His Critic...,Graham Holdings,"Gregory Ablavsky, Clarence Thomas Went After M..."
4,Akhil Amar's Unusable Past,University of Michigan Law School,"Gregory Ablavsky, Akhil Amar's Unusable Past, ..."
5,"Book Review, Creek Internationalism in an Age ...","The University of North Carolina Press,Univers...","Gregory Ablavsky, Creek Internationalism in an..."
6,Too Much History: Castro-Huerta and the Proble...,University of Chicago Press,"Gregory Ablavsky, Too Much History: Castro-Hue..."
7,We the (Native) People?: How Indigenous People...,Columbia Law School,"Gregory Ablavsky & W. Tanner Allread, We the (..."
8,Getting Public Rights Wrong: The Lost History ...,Stanford Law School,"Gregory Ablavsky, Getting Public Rights Wrong:..."
9,Oklahoma's Bizarro Nineteenth Century in Castr...,Stanford Law School,"Gregory Ablavsky, Oklahoma's Bizarro Nineteent..."


In [51]:
#Remove Quotations from Publisher
df["Publisher"] = df["Publisher"].apply(
    lambda x: x.replace('"', '').replace("'", '') if isinstance(x, str) else x
)

## Extracting Year from Date

In [52]:
# Take first 4 characters and convert to a year
df["Year"] = pd.to_numeric(
    df["Date"].str.strip().str.slice(0, 4),
    errors="coerce"
).astype("Int64")

#Validation Check
df.loc[df["ID"] == 448166, ["ID", "Faculty Contributor", "Title", "Year", "Date"]]

Unnamed: 0,ID,Faculty Contributor,Title,Year,Date
1630,448166.0,"Sivas, Deborah A.",Should We Bring Species Back from Extinction?,2023,2023-06-12


## ID Deduplication

Further analysis discovers out of the 1926 IDs, 1786 are unique. This means there are 140 duplicated IDs.
It is crucial to check whether these are exactly repeated, such as the same publication copied from multiple sources or any presence of errors.

The idea is to delete the exact duplicates where all columns are the same in separate rows and investigate rows with same ID but different data in columns.

In [53]:
id_counts = df['ID'].value_counts()
dup_ids = id_counts[id_counts > 1].index

print("Number of IDs with more than one row:", len(dup_ids))

dup_df = df[df['ID'].isin(dup_ids)].copy()
print("Number of rows with duplicate IDs:", dup_df.shape[0])

Number of IDs with more than one row: 126
Number of rows with duplicate IDs: 266


In [54]:
# Are any duplicate-ID rows completely identical across all columns?
dup_df['is_exact_duplicate'] = dup_df.duplicated(keep=False)

print("Rows that are exact duplicates (same ID, same all fields):")
display(dup_df[dup_df['is_exact_duplicate']].head(10))

Rows that are exact duplicates (same ID, same all fields):


Unnamed: 0,ID,Faculty Contributor,Faculty Co-Authors/Editors,Co-Authors,Title,Publication Type,Contribution,Date,Year,Status,...,Abstract,Source Link(s),Stanford Link,DOI,PURL,SSRN,In ORCID,Corrected,Designation,is_exact_duplicate
127,407567.0,"Brest, Paul",,,Processes of Constitutional Decisionmaking: Ca...,Textbook/Casebook,,2021,2021,,...,,,https://law.stanford.edu/publications/processe...,,,,,,Emeritus,True
128,407567.0,"Brest, Paul",,,Processes of Constitutional Decisionmaking: Ca...,Textbook/Casebook,,2021,2021,,...,,,https://law.stanford.edu/publications/processe...,,,,,,Emeritus,True
185,407587.0,"Daines, Robert M.",,,Recent Developments in Executive Compensation ...,"Book, Section",Writer,2023,2023,,...,,,https://law.stanford.edu/publications/recent-d...,,,,,,Emeritus,True
187,407587.0,"Daines, Robert M.",,,Recent Developments in Executive Compensation ...,"Book, Section",Writer,2023,2023,,...,,,https://law.stanford.edu/publications/recent-d...,,,,,,Emeritus,True
450,499949.0,"Friedman, Lawrence M.",,,Freedom of Expression and the Age of the Silve...,Journal Article,Writer,2024-10,2024,Published,...,,https://heinonline-org.ezproxy.law.stanford.ed...,https://law.stanford.edu/publications/freedom-...,,,,,,Emeritus,True
451,499949.0,"Friedman, Lawrence M.",,,Freedom of Expression and the Age of the Silve...,Journal Article,Writer,2024-10,2024,Published,...,,https://heinonline-org.ezproxy.law.stanford.ed...,https://law.stanford.edu/publications/freedom-...,,,,,,Emeritus,True
518,407002.0,"Goldstein, Paul",,,Setting Boundaries,Blog Postings,Writer,2021,2021,,...,,,https://law.stanford.edu/publications/setting-...,,,,,,Faculty,True
519,407002.0,"Goldstein, Paul",,,Setting Boundaries,Blog Postings,Writer,2021,2021,,...,,,https://law.stanford.edu/publications/setting-...,,,,,,Faculty,True
564,500143.0,"Greely, Henry T.",,,Reference Guide on Neuroscience,"Book, Section",Writer,2025,2025,,...,,,https://law.stanford.edu/publications/referenc...,,,,,,Faculty,True
565,500143.0,"Greely, Henry T.",,,Reference Guide on Neuroscience,"Book, Section",Writer,2025,2025,,...,,,https://law.stanford.edu/publications/referenc...,,,,,,Faculty,True


In [55]:
dup_df['is_exact_duplicate'].value_counts()

Unnamed: 0_level_0,count
is_exact_duplicate,Unnamed: 1_level_1
False,191
True,75


In [56]:
non_exact = dup_df[~dup_df['is_exact_duplicate']].copy()

print(non_exact.shape[0])

display(
    non_exact
    .sort_values(["ID", "Faculty Contributor", "Title"])
    .head(10)
)

191


Unnamed: 0,ID,Faculty Contributor,Faculty Co-Authors/Editors,Co-Authors,Title,Publication Type,Contribution,Date,Year,Status,...,Abstract,Source Link(s),Stanford Link,DOI,PURL,SSRN,In ORCID,Corrected,Designation,is_exact_duplicate
35,229380.0,"Ablavsky, Gregory","Ouellette, Lisa Larrimore",,Selling Patents to Indian Tribes to Delay the ...,Journal Article,Writer,2018-01-02,2018,Published,...,,https://jamanetwork.com/journals/jamainternalm...,https://law.stanford.edu/publications/selling-...,,,,,checked,Faculty,False
1484,229380.0,"Ouellette, Lisa Larrimore","Ablavsky, Gregory",,Selling Patents to Indian Tribes to Delay the ...,Journal Article,Writer,2018-01-02,2018,Published,...,,https://jamanetwork.com/journals/jamainternalm...,https://law.stanford.edu/publications/selling-...,,,,checked,checked,Faculty,False
508,229440.0,"Gilson, Ronald J.","Brest, Paul","Wolfson, Mark",How Investors Can (and Can't) Create Social Value,Brief,,2017-12-06,2017,,...,,,https://law.stanford.edu/publications/brief-fo...,,,,,,Emeritus,False
936,229440.0,"Klausner, Michael",,,Brief for Corporate Law Professors As Amici Cu...,Brief,,2017-12-06,2017,Published,...,,,https://law.stanford.edu/publications/brief-fo...,,,,,,Faculty,False
1047,229445.0,"Lemley, Mark A.","Malone, Philip R.","Pearlman, Jef",Brief of Amici Curiae Law Professors and Publi...,Brief,Writer,2018-01-23,2018,Published,...,,,https://law.stanford.edu/publications/brief-of...,,,,,checked,Faculty,False
1079,229445.0,"Malone, Philip R.",,,Brief of Amici Curiae Law Professors and Publi...,Brief,,2018-01-23,2018,,...,,,https://law.stanford.edu/publications/brief-of...,,,,,,Clinical Faculty,False
220,232417.0,"Donohue III, John J.",,"Morantz, Alison D.",Brief of Amici Curiae Economists and Professor...,Brief,,2018-01-18,2018,Published,...,,,https://law.stanford.edu/publications/brief-of...,,,,,checked,Faculty,False
1333,232417.0,"Morantz, Alison D.",,,Brief of Amici Curiae Economists and Professor...,Brief,,2018-01-18,2018,,...,,,https://law.stanford.edu/publications/brief-of...,,,,,,Faculty,False
1262,232841.0,"Mello, Michelle M.","Sonne, James A.","Opel, Douglas J.",Vaccination without Litigation - Addressing Re...,Journal Article,Writer,2018-03-01,2018,Published,...,,https://www.nejm.org/doi/full/10.1056/NEJMp171...,https://law.stanford.edu/publications/vaccinat...,,,,,checked,Faculty,False
1683,232841.0,"Sonne, James A.",,,Vaccination without Litigation - Addressing Re...,Journal Article,Writer,2018-03-01,2018,Published,...,,https://www.nejm.org/doi/full/10.1056/NEJMp171...,https://law.stanford.edu/publications/vaccinat...,,,,,,Clinical Faculty,False


In [57]:
# Drop exact row duplicates across ALL columns
df_clean = df.drop_duplicates(keep="first").copy()

print("Before:", df.shape[0], "rows")
print("After :", df_clean.shape[0], "rows")

Before: 1927 rows
After : 1889 rows


The exact duplicate rows are now removed.
The other rows with same ID has small differences such as Faculty Contributor and Faculty Co-Author mismatched or titles including special characters.

In [58]:
# Recompute duplicate IDs on the cleaned frame
id_counts = df_clean["ID"].value_counts()
dup_ids = id_counts[id_counts > 1].index

dup_df = df_clean[df_clean["ID"].isin(dup_ids)].copy()

print("Number of rows with duplicate IDs (after exact dedupe):", dup_df.shape[0])
print("Number of duplicate IDs:", len(dup_ids))

Number of rows with duplicate IDs (after exact dedupe): 193
Number of duplicate IDs: 91


## Rule based Deduplication using AI

Remove for exact duplicates, keep one only

Keep row with Corrected = checked if only 1

Else pass it to LLM to flag potential row to remove

LLM follows the decision based on completeness and consistency

In [59]:
api_key = userdata.get('OPENAI_API_KEY')
if api_key is None:
    raise ValueError("No OPENAI_API_KEY secret found. Add it in Colab → Secrets sidebar.")

client = OpenAI(api_key=api_key)

cols_for_llm = [
    "ID",
    "Faculty Contributor",
    "Faculty Co-Authors/Editors",
    "Co-Authors",
    "Title",
    "Publication Type",
    "Contribution",
    "Date",
    "Status",
    "Serial Title",
    "Publisher",
    "Publication Title",
    "Editor(s)",
    "Citation",
    "ISBN",
    "ISSN",
    "Source Link(s)",
    "Stanford Link",
]

def choose_row_with_llm(group: pd.DataFrame) -> int:
    """
    Given a group of rows with the same ID (from df_clean) that has no
    unique 'Corrected == "checked"' row, ask the model which row index
    to KEEP (canonical). Returns the DataFrame index (int) of the chosen row.
    """

    rows = []
    for idx, row in group[cols_for_llm].iterrows():
        data = row.to_dict()
        data["row_index"] = int(idx)
        rows.append(data)

    prompt = f"""
You are helping deduplicate faculty publication records from Stanford Law.

All the rows below share the same internal ID and are candidate records for the same publication.
Each row has a unique 'row_index'.

Your task:
Select exactly ONE row to treat as the canonical record to KEEP.
Downstream, we will mark that row with "yes" in a 'remove' column and treat the others as candidates to drop.

Decision rules (apply in this order):
1. Prefer rows that are more complete:
   - Fewer null/empty values across fields is better.
   - Pay special attention to fields like Title, Date/Year, Publication Type, Serial Title, Status, Stanford Link, and Citation.
2. Prefer rows that are more internally consistent:
   - Date/Year, Status, and Citation should agree (e.g., no obviously impossible years).
   - Title and Serial Title should form a plausible legal publication (journal/book/etc.).
   - Stanford Link / URLs should look plausible and not clearly malformed.
3. Prefer rows with stronger identifiers or links:
   - Presence of a valid Stanford Link or other stable URL is a positive signal.
4. If there is still a tie:
   - Prefer the row with richer text (e.g., longer non-empty Title or Citation).
   - If still tied, choose the row with the lowest 'row_index'.

Output format:
Return ONLY valid JSON with the single row_index you choose to KEEP:

{{
  "canonical_row_index": <integer>
}}

Do not include any explanation or additional fields.

Here are the rows (JSON list):

{json.dumps(rows, ensure_ascii=False, indent=2)}
"""

    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": prompt}]
    )

    content = response.choices[0].message.content
    result = json.loads(content)
    return int(result["canonical_row_index"])




In [60]:

df_clean["remove"] = "no"
llm_count = 0
skip_count = 0   # groups resolved without LLM

# Group by ID and decide which row to keep per group
for id_val, group in dup_df.groupby("ID"):
    corrected_mask = group["Corrected"].eq("checked")
    num_corrected = corrected_mask.sum()

    if num_corrected == 1:
        # Exactly one corrected row → keep that one, drop the others (no LLM call)
        keep_idx = group[corrected_mask].index[0]
        skip_count += 1
    else:
        # Either 0 or >1 corrected rows → call LLM to choose the single best row to KEEP
        keep_idx = choose_row_with_llm(group)
        llm_count += 1

    # First mark all rows in this ID group as "yes" (to be removed)
    df_clean.loc[group.index, "remove"] = "yes"

    # Then mark the chosen row as "no" (KEEP)
    df_clean.loc[keep_idx, "remove"] = "no"

print(f"LLM was used for {llm_count} ID groups.")
print(f"Skipped LLM for {skip_count} ID groups (exactly one corrected row).")

LLM was used for 61 ID groups.
Skipped LLM for 30 ID groups (exactly one corrected row).


In [61]:
# Sanity check
checked = df_clean[
    (df_clean["Corrected"] == "checked") &
    (df_clean["remove"] == "yes")
]

print("Total rows with Corrected='checked' AND remove='yes':", len(checked))

checked["ID"].head()

Total rows with Corrected='checked' AND remove='yes': 37


Unnamed: 0,ID
114,494404.0
280,501086.0
296,384899.0
304,308137.0
325,485895.0


In [62]:
ids_to_check = [494404, 501086, 384899, 485895, 485899]

df_clean[df_clean["ID"].isin(ids_to_check)]

Unnamed: 0,ID,Faculty Contributor,Faculty Co-Authors/Editors,Co-Authors,Title,Publication Type,Contribution,Date,Year,Status,...,Abstract,Source Link(s),Stanford Link,DOI,PURL,SSRN,In ORCID,Corrected,Designation,remove
113,494404.0,"Belt, Rabia",,,"Rabia Belt, Disability, Dignity, and Democracy...","Book, Section",Writer,2025,2025,Forthcoming,...,,,https://law.stanford.edu/publications/rabia-be...,,,,,checked,Faculty,no
114,494404.0,"Belt, Rabia",,,"Disability, Dignity, and Democracy","Book, Section",Writer,2025,2025,Forthcoming,...,,,https://law.stanford.edu/publications/rabia-be...,,,,,checked,Faculty,yes
279,485895.0,"Engstrom, David Freeman","Engstrom, Nora Freeman",,Rethinking the Lawyers' Monopoly: Access to Ju...,"Book, Whole","Editor,Writer",2025-09,2025,Published,...,,,https://law.stanford.edu/publications/rethinki...,,,,,checked,Faculty,no
280,501086.0,"Engstrom, David Freeman","Engstrom, Nora Freeman",,Envisioning the Future of Legal Services,"Book, Section",Writer,2025-09,2025,Published,...,,,https://law.stanford.edu/publications/envision...,,,,,checked,Faculty,yes
284,485899.0,"Engstrom, David Freeman","Engstrom, Nora Freeman","Gelbach, Jonah B.; Peters, Austin; Wen, Garrett",Shedding Light on Secret Settlements: An Empir...,Journal Article,Writer,2025-01,2025,Published,...,,https://lawreview.uchicago.edu/print-archive/s...,https://law.stanford.edu/publications/shedding...,,,,,checked,Faculty,no
296,384899.0,"Engstrom, David Freeman","Engstrom, Nora Freeman",,Legal Tech and the Litigation Playing Field,"Book, Section",Writer,2023,2023,Published,...,,,https://law.stanford.edu/publications/legal-te...,https://doi.org/10.1017/9781009255301.009,,,,checked,Faculty,yes
322,501086.0,"Engstrom, Nora Freeman","Engstrom, David Freeman",,Envisioning the Future of Legal Services,"Book, Section",Writer,2025-09,2025,Published,...,,,https://law.stanford.edu/publications/envision...,,,,,checked,Faculty,no
325,485895.0,"Engstrom, Nora Freeman","Engstrom, David Freeman",,Rethinking the Lawyers' Monopoly: Access to Ju...,"Book, Whole","Editor,Writer",2025-09,2025,Published,...,,,https://law.stanford.edu/publications/rethinki...,,,,,checked,Faculty,yes
327,485899.0,"Engstrom, Nora Freeman","Engstrom, David Freeman","Gelbach, Jonah B.; Peters, Austin; Wen, Garrett",Shedding Light on Secret Settlements: An Empir...,Journal Article,Writer,2025-01,2025,Published,...,,https://lawreview.uchicago.edu/print-archive/s...,https://law.stanford.edu/publications/shedding...,,,,,checked,Faculty,yes
343,384899.0,"Engstrom, Nora Freeman","Engstrom, David Freeman",,Legal Tech and the Litigation Playing Field,"Book, Section",Writer,2023-02-20,2023,Published,...,,,https://law.stanford.edu/publications/legal-te...,https://doi.org/10.1017/9781009255301.009,,,,checked,Faculty,no


In [63]:
# 1) Drop rows marked for removal
df_clean.drop(df_clean.index[df_clean["remove"] == "yes"], inplace=True)

# 2) Total number of rows after removal
total_rows = len(df_clean)
print("Total rows after dedup:", total_rows)

# 3) Count duplicate IDs that still remain
id_counts = df_clean["ID"].value_counts()
dup_ids = id_counts[id_counts > 1].index
num_dup_ids = len(dup_ids)

print("Number of IDs that still have duplicates:", num_dup_ids)

Total rows after dedup: 1787
Number of IDs that still have duplicates: 0


In [64]:
# Check for duplicates in other columns
cols_to_check = ["Title", "Citation", "Source Link(s)", "Stanford Link"]

for col in cols_to_check:
    dup_rows = df_clean[df_clean.duplicated(subset=[col], keep=False) & df_clean[col].notna()]

    num_dup_values = dup_rows[col].nunique()
    num_dup_rows = dup_rows.shape[0]

    print(f"\n=== Column: {col} ===")
    print(f"Number of distinct duplicated values: {num_dup_values}")
    print(f"Number of rows involved in those duplicates: {num_dup_rows}")

    if num_dup_rows > 0:
        print("Sample duplicate groups:")
        display(
            dup_rows
            .sort_values(col)
            [[col, "ID", "Faculty Contributor", "Year"]]
            .head(10)
        )


=== Column: Title ===
Number of distinct duplicated values: 67
Number of rows involved in those duplicates: 145
Sample duplicate groups:


Unnamed: 0,Title,ID,Faculty Contributor,Year
811,A Pro-Feminist Life: Sherry Colb and Abortion ...,500801.0,"Karlan, Pamela S.",2024
824,A Pro-Feminist Life: Sherry Colb and Abortion ...,449820.0,"Karlan, Pamela S.",2023
922,A Sober Look at SPACs,403738.0,"Klausner, Michael",2022
932,A Sober Look at SPACs,404415.0,"Klausner, Michael",2020
929,A Sober Look at SPACs,383968.0,"Klausner, Michael",2021
638,Abandoned and Split But Never Reversed: Borak ...,449660.0,"Grundfest, Joseph A.",2023
639,Abandoned and Split But Never Reversed: Borak ...,497821.0,"Grundfest, Joseph A.",2023
640,Abandoned and Split But Never Reversed: Borak ...,448620.0,"Grundfest, Joseph A.",2022
991,Abandoning Trade Secrets,329348.0,"Lemley, Mark A.",2021
982,Abandoning Trade Secrets,404579.0,"Lemley, Mark A.",2022



=== Column: Citation ===
Number of distinct duplicated values: 8
Number of rows involved in those duplicates: 16
Sample duplicate groups:


Unnamed: 0,Citation,ID,Faculty Contributor,Year
571,"Amander Clark, Eric Topol, Hank Greely, Salim ...",514279.0,"Greely, Henry T.",2024
573,"Amander Clark, Eric Topol, Hank Greely, Salim ...",514272.0,"Greely, Henry T.",2024
1797,"Barton H. Thompson, Jr., Liquid Asset: How Bus...",445849.0,"Thompson, Barton H. ""Buzz""",2023
1798,"Barton H. Thompson, Jr., Liquid Asset: How Bus...",448073.0,"Thompson, Barton H. ""Buzz""",2023
1277,"Bernadette Meyler, Bernadette Meyler Staging t...",447842.0,"Meyler, Bernadette",2023
1280,"Bernadette Meyler, Bernadette Meyler Staging t...",407098.0,"Meyler, Bernadette",2022
379,"George Fisher, Beware Euphoria: The Moral Root...",496954.0,"Fisher, George",2024
380,"George Fisher, Beware Euphoria: The Moral Root...",448342.0,"Fisher, George",2023
1576,"Kathleen G. Noonan, Jonathan C. Lipson & Willi...",285902.0,"Simon, William",2019
1577,"Kathleen G. Noonan, Jonathan C. Lipson & Willi...",,"Simon, William",2019



=== Column: Source Link(s) ===
Number of distinct duplicated values: 17
Number of rows involved in those duplicates: 35
Sample duplicate groups:


Unnamed: 0,Source Link(s),ID,Faculty Contributor,Year
1063,https://books.google.com/books/about/Deliberat...,308215.0,"MacCoun, Robert J.",2021
1062,https://books.google.com/books/about/Deliberat...,308213.0,"MacCoun, Robert J.",2021
984,https://clause8publishing.com/ipnta,403388.0,"Lemley, Mark A.",2022
988,https://clause8publishing.com/ipnta,403225.0,"Lemley, Mark A.",2022
1035,https://searchworks.stanford.edu/view/13180472,285505.0,"Lemley, Mark A.",2019
1033,https://searchworks.stanford.edu/view/13180472,285507.0,"Lemley, Mark A.",2019
1026,https://searchworks.stanford.edu/view/13180472,285489.0,"Lemley, Mark A.",2019
462,https://searchworks.stanford.edu/view/13736857,383918.0,"Friedman, Lawrence M.",2021
475,https://searchworks.stanford.edu/view/13736857,383916.0,"Friedman, Lawrence M.",2020
90,https://stanfordmag.org/contents/what-should-f...,283575.0,"Banks, Ralph Richard",2019



=== Column: Stanford Link ===
Number of distinct duplicated values: 1
Number of rows involved in those duplicates: 2
Sample duplicate groups:


Unnamed: 0,Stanford Link,ID,Faculty Contributor,Year
1576,https://law.stanford.edu/publications/reformin...,285902.0,"Simon, William",2019
1577,https://law.stanford.edu/publications/reformin...,,"Simon, William",2019


In [14]:
df["Italics"].equals(df["Corrected"]) #The italics column can be removed

True

In [71]:
print("Shape (rows, columns):", df_clean.shape)

Shape (rows, columns): (1787, 32)


In [73]:
print(df_clean.columns.tolist())

['ID', 'Faculty Contributor', 'Faculty Co-Authors/Editors', 'Co-Authors', 'Title', 'Publication Type', 'Contribution', 'Date', 'Year', 'Status', 'Serial Title', 'Publisher', 'Publication Title', 'Editor(s)', 'Citation', 'Italics', 'Volume', 'Issue', 'Pages', 'Edition', 'ISBN', 'ISSN', 'Abstract', 'Source Link(s)', 'Stanford Link', 'DOI', 'PURL', 'SSRN', 'In ORCID', 'Corrected', 'Designation', 'remove']


## Data Curation using Crossref

In [65]:
from difflib import SequenceMatcher

TARGET_COLS = [
    "Source Link(s)",
    "DOI",
    "Volume",
    "Issue",
    "Pages",
    "Edition",
    "ISBN",
    "ISSN",
]

CROSSREF_API_BASE = "https://api.crossref.org/works"


def query_crossref_by_title(title, author=None, year=None, rows=5):
    """
    Query Crossref using title (+ optional author/year).
    Returns a list of candidate items (possibly empty).
    """
    if not isinstance(title, str) or not title.strip():
        return []

    params = {
        "query.title": title,
        "rows": rows,
    }

    if isinstance(author, str) and author.strip():
        params["query.author"] = author

    if year is not None:
      y = int(year)
      params["filter"] = f"from-pub-date:{y}-01-01,until-pub-date:{y}-12-31"

    try:
        r = requests.get(CROSSREF_API_BASE, params=params, timeout=10)
        r.raise_for_status()
    except Exception as e:
        print("Crossref request error:", e)
        return []

    data = r.json()
    return data.get("message", {}).get("items", [])


def pick_best_crossref_match(title, items, min_ratio=0.8):
    """
    Pick the Crossref item whose title best matches the given title.
    Returns (best_item, best_score) where best_item may be None.
    """
    if not items:
        return None, 0.0

    title_norm = title.strip().lower()
    best_item = None
    best_score = 0.0

    for item in items:
        item_titles = item.get("title", []) or []
        if not item_titles:
            continue
        item_title = item_titles[0].strip().lower()
        ratio = SequenceMatcher(None, title_norm, item_title).ratio()
        if ratio > best_score:
            best_score = ratio
            best_item = item

    if best_score < min_ratio:
        return None, best_score

    return best_item, best_score


def extract_metadata_from_crossref(item):
    """
    Extract fields and returns a dict keyed by our column names.
    """
    if item is None:
        return {}

    updates = {}

    doi = item.get("DOI")
    if doi:
        updates["DOI"] = doi

    url = item.get("URL")
    if url:
        updates["Source Link(s)"] = url

    vol = item.get("volume")
    if vol:
        updates["Volume"] = str(vol)

    issue = item.get("issue")
    if issue:
        updates["Issue"] = str(issue)

    pages = item.get("page")
    if pages:
        updates["Pages"] = str(pages)

    issn_list = item.get("ISSN", []) or []
    if issn_list:
        updates["ISSN"] = issn_list[0]

    isbn_list = item.get("ISBN", []) or []
    if isbn_list:
        updates["ISBN"] = isbn_list[0]

    edition = item.get("edition-number") or item.get("edition")
    if edition:
        updates["Edition"] = str(edition)


    return updates

In [78]:
df_enriched = df_clean.copy()

# make target cols string dtype
for col in TARGET_COLS:
    if col in df_enriched.columns:
        df_enriched[col] = df_enriched[col].astype("string")

df_enriched["Flagged"] = "no"

# 1) Full set of rows that have at least one missing target field
rows_to_enrich = df_enriched[df_enriched[TARGET_COLS].isna().any(axis=1)]
print("Rows with at least one missing target field:", rows_to_enrich.shape[0])

# 2) Subset of 100 rows we will actually hit Crossref with
rows_subset = rows_to_enrich.head(100).copy()
print("Rows to enrich now:", rows_subset.shape[0])

enriched = 0
enriched_indices = []

for idx, row in rows_subset.iterrows():
    title = row["Title"]
    year = int(row["Year"])
    faculty = row["Faculty Contributor"]  # "Lastname, Firstname"
    author = faculty.split(",")[0].strip()

    items = query_crossref_by_title(title, author=author, year=year, rows=5)
    best_item, score = pick_best_crossref_match(title, items, min_ratio=0.8)

    if best_item is None:
        continue

    updates = extract_metadata_from_crossref(best_item)
    if not updates:
        continue

    row_enriched = False

    for col, val in updates.items():
        if col not in df_enriched.columns:
            continue

        current = df_enriched.at[idx, col]
        current_str = None if pd.isna(current) else str(current).strip()

        if current_str is None or current_str == "":
            df_enriched.at[idx, col] = None if val is None else str(val)
            row_enriched = True

    if row_enriched:
        enriched += 1
        enriched_indices.append(idx)

print("Rows successfully enriched from Crossref from 100 subset", enriched)

Rows with at least one missing target field: 1787
Rows to enrich now: 100
Rows successfully enriched from Crossref from 100 subset 23


Checked with 100 rows due to rate limits on free API. Successful in filling 23% of the rows

Compare before and after to see the differences

In [95]:

cols_to_check = [
    "Title", "Year", "Faculty Contributor",
    "DOI", "ISSN", "Volume", "Issue", "Pages", "Source Link(s)"
]

if enriched_indices:
    sample_indices = enriched_indices[:10]

    with pd.option_context("display.max_colwidth", None,
                       "display.width", None,
                       "display.max_columns", None):
      print("Before (df_clean) – sample of enriched rows:")
      display(df_clean.loc[sample_indices, cols_to_check])

      print("After (df_enriched) – same sample:")
      display(df_enriched.loc[sample_indices, cols_to_check])
else:
    print("No rows were enriched in this sample.")

Before (df_clean) – sample of enriched rows:


Unnamed: 0,Title,Year,Faculty Contributor,DOI,ISSN,Volume,Issue,Pages,Source Link(s)
2,The Original Meaning of Commerce in the Indian Commerce Clause,2024,"Ablavsky, Gregory",,,56.0,,1013,
4,Akhil Amar's Unusable Past,2023,"Ablavsky, Gregory",,,121.0,,1119,
6,Too Much History: Castro-Huerta and the Problem of Change in Indian Law,2023,"Ablavsky, Gregory",,,2022.0,,293,
11,"Beyond the Indian Commerce Clause: Robert Natelson's Problematic ""Cite Check""",2022,"Ablavsky, Gregory",,,,,,
18,Credit Nation: Property Laws and Institutions in Early America,2021,"Ablavsky, Gregory",,,61.0,,340,
20,Murder in the Shenandoah: Making Law Sovereign in Revolutionary Virginia,2020,"Ablavsky, Gregory",,,40.0,,752,https://muse.jhu.edu/article/772960
21,Of One Mind and of One Government: The Rise and Fall of the Creek Nation in the Early Republic,2020,"Ablavsky, Gregory",,,86.0,,143,https://muse.jhu.edu/article/748749
27,"Species of Sovereignty: Native Nationhood, the United States, and International Law, 1783-1795",2019,"Ablavsky, Gregory",,,106.0,3.0,591–613,https://academic.oup.com/jah/article/106/3/591/5628951
35,Selling Patents to Indian Tribes to Delay the Market Entry of Generic Drugs,2018,"Ablavsky, Gregory",,,178.0,,179,https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2666791
38,Racism,2025,"Achiume, E. Tendayi",,,,,,


After (df_enriched) – same sample:


Unnamed: 0,Title,Year,Faculty Contributor,DOI,ISSN,Volume,Issue,Pages,Source Link(s)
2,The Original Meaning of Commerce in the Indian Commerce Clause,2024,"Ablavsky, Gregory",10.2139/ssrn.4911164,1556-5068,56.0,,1013,https://doi.org/10.2139/ssrn.4911164
4,Akhil Amar's Unusable Past,2023,"Ablavsky, Gregory",10.36644/mlr.121.6.akhil,1939-8557,121.0,121.6,1119,https://doi.org/10.36644/mlr.121.6.akhil
6,Too Much History: Castro-Huerta and the Problem of Change in Indian Law,2023,"Ablavsky, Gregory",10.1086/724831,0081-9557,2022.0,,293,https://doi.org/10.1086/724831
11,"Beyond the Indian Commerce Clause: Robert Natelson's Problematic ""Cite Check""",2022,"Ablavsky, Gregory",10.2139/ssrn.4244353,1556-5068,,,,https://doi.org/10.2139/ssrn.4244353
18,Credit Nation: Property Laws and Institutions in Early America,2021,"Ablavsky, Gregory",10.1093/ajlh/njab014,0002-9319,61.0,3.0,340,https://doi.org/10.1093/ajlh/njab014
20,Murder in the Shenandoah: Making Law Sovereign in Revolutionary Virginia,2020,"Ablavsky, Gregory",10.1353/jer.2020.0109,1553-0620,40.0,4.0,752,https://muse.jhu.edu/article/772960
21,Of One Mind and of One Government: The Rise and Fall of the Creek Nation in the Early Republic,2020,"Ablavsky, Gregory",10.1353/soh.2020.0046,2325-6893,86.0,1.0,143,https://muse.jhu.edu/article/748749
27,"Species of Sovereignty: Native Nationhood, the United States, and International Law, 1783-1795",2019,"Ablavsky, Gregory",10.1093/jahist/jaz503,0021-8723,106.0,3.0,591–613,https://academic.oup.com/jah/article/106/3/591/5628951
35,Selling Patents to Indian Tribes to Delay the Market Entry of Generic Drugs,2018,"Ablavsky, Gregory",10.1001/jamainternmed.2017.7463,2168-6106,178.0,2.0,179,https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2666791
38,Racism,2025,"Achiume, E. Tendayi",10.4337/9781802204155.00078,,,,423-428,https://doi.org/10.4337/9781802204155.00078


In [74]:
print("Shape (rows, columns):", df_enriched.shape)

Shape (rows, columns): (1787, 33)


## Bluebook Citation Suggestor

LLM finds potential issues and suggests the bluebook style formatted citation.

This requires heavy manual reviews

Better to use 3rd party tools if available and legal to use

In [83]:
def analyze_citation_bluebook(citation: str):
    """
    For a given citation string, ask an LLM:
    - What (if any) issues it has w.r.t. Bluebook style.
    - A suggested Bluebook-style citation.

    Returns (issues_text_or_None, suggested_citation_or_None).
    """
    if pd.isna(citation) or not str(citation).strip():
        return None, None

    citation_str = str(citation).strip()

    system_msg = (
        "You are an expert in Bluebook legal citation format. "
        "Given a citation string, identify any issues with Bluebook compliance "
        "(abbreviations, ordering, punctuation, missing fields, etc.). "
        "Then provide a corrected Bluebook-style citation. "
        "Do NOT invent new works; use only the information present in the input."
    )

    user_prompt = (
        "Input citation:\n\n"
        f"{citation_str}\n\n"
        "Return ONLY JSON using this schema:\n"
        "{\n"
        '  "issues": ["short issue description 1", "short issue description 2", ...],\n'
        '  "suggested_citation": "the citation rewritten in Bluebook style"\n'
        "}\n"
        "If the citation already looks acceptable in Bluebook, return an empty list for "
        '"issues" and you may return the original citation as suggested_citation.\n'
    )

    resp = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_prompt},
        ]
    )

    content = resp.choices[0].message.content

    try:
        data = json.loads(content)
    except json.JSONDecodeError:
        return "LLM response not valid JSON", None

    issues = data.get("issues", [])
    suggested = data.get("suggested_citation", "").strip() or None

    if not issues:
        return None, None  # no issues, no suggested-citation override

    issues_text = "; ".join(issues)
    return issues_text, suggested

In [96]:

df_enriched["Issue in Citation"] = pd.NA
df_enriched["Suggested Bluebook Citation"] = pd.NA

subset = df_enriched[df_enriched["Citation"].notna()].head(5)

for idx, cit in subset["Citation"].items():
    issues, suggestion = analyze_citation_bluebook(cit)

    # Column 1: 'Issue in Citation' – either issue text or None
    df_enriched.at[idx, "Issue in Citation"] = issues

    # Column 2: 'Suggested Bluebook Citation'
    # Only fill when there is an issue
    if issues is not None and suggestion is not None:
        df_enriched.at[idx, "Suggested Bluebook Citation"] = suggestion
    else:
        # Already acceptable → leave blank / NA
        df_enriched.at[idx, "Suggested Bluebook Citation"] = pd.NA

# Quick check: show only rows where an issue was found
with pd.option_context("display.max_colwidth", None,
                       "display.width", None,
                       "display.max_columns", None):
  display(
      df_enriched[
          df_enriched["Issue in Citation"].notna()
      ][["Citation", "Issue in Citation", "Suggested Bluebook Citation"]]
      .head(5)
  )

Unnamed: 0,Citation,Issue in Citation,Suggested Bluebook Citation
0,"Gregory Ablavsky, Speculation Nation: Land Mania in the Revolutionary American Republic, 129 Aᴍ. Hɪꜱᴛ. Rᴇᴠ. 1837 (2024) (book review).","Journal abbreviation uses special/small-cap characters; use standard Bluebook abbreviation 'Am. Hist. Rev.'; Book title should be italicized in Bluebook formatting; Possible misattribution: citation lists Gregory Ablavsky as the author of the review; if this is a review of Ablavsky's book, the reviewer's name (not the book author) should appear","Gregory Ablavsky, Speculation Nation: Land Mania in the Revolutionary American Republic, 129 Am. Hist. Rev. 1837 (2024) (book review)."
1,"Fᴇʟɪx S. Cᴏʜᴇɴ, Cᴏʜᴇɴ'ꜱ Hᴀɴᴅʙᴏᴏᴋ ᴏꜰ Fᴇᴅᴇʀᴀʟ Iɴᴅɪᴀɴᴀ Lᴀᴡ (Gregory Ablavsky, Bethany R. Berger & Maggie Blackhawk eds., 2024).","Uses stylized Unicode small caps and special characters instead of ordinary letters; Uses a nonstandard/special-character apostrophe in ""Cᴏʜᴇɴ'ꜱ""; Unnecessary comma after ""eds."" before the year (Bluebook style uses e.g., ""eds. 2024"")","Felix S. Cohen, Cohen's Handbook of Federal Indian Law (Gregory Ablavsky, Bethany R. Berger & Maggie Blackhawk eds. 2024)."
2,"Gregory Ablavsky, The Original Meaning of Commerce in the Indian Commerce Clause, 56 Cᴏɴɴ. L. Rᴇᴠ. 1013 (2024).","Journal abbreviation uses nonstandard/special characters (e.g., 'Cᴏɴɴ. L. Rᴇᴠ.'); Use standard Bluebook abbreviation: 'Conn. L. Rev.'","Gregory Ablavsky, The Original Meaning of Commerce in the Indian Commerce Clause, 56 Conn. L. Rev. 1013 (2024)."
3,"Gregory Ablavsky, Clarence Thomas Went After My Work. His Criticisms Reveal a Disturbing Fact About Originalism, Sʟᴀᴛᴇ (June 20, 2023), https://slate.com/news-and-politics/2023/06/clarence-thomas-indian-law-originalism-history.html.",Periodical title uses Unicode small-caps 'Sʟᴀᴛᴇ' rather than normal 'Slate'.; Bluebook calls for the periodical title to be presented (typically in italic type) — here it appears plain/special-case formatted.; Trailing punctuation after the URL (final period) can interfere with link copying; Bluebook citations with URLs typically do not include a terminal period.,"Gregory Ablavsky, Clarence Thomas Went After My Work. His Criticisms Reveal a Disturbing Fact About Originalism, Slate (June 20, 2023), https://slate.com/news-and-politics/2023/06/clarence-thomas-indian-law-originalism-history.html"
4,"Gregory Ablavsky, Akhil Amar's Unusable Past, 121 Mɪᴄʜ. L. Rᴇᴠ. 1119 (2023) (book review).",Journal title uses nonstandard/special-character capitalization (Mɪᴄʜ. L. Rᴇᴠ.) instead of the Bluebook abbreviation; Journal abbreviation should be rendered as 'Mich. L. Rev.' with standard periods and spacing,"Gregory Ablavsky, Akhil Amar's Unusable Past, 121 Mich. L. Rev. 1119 (2023) (book review)."


Only ran for few samples to keep API cost low and potential Colab timeout issues at times. It can be ran on the full dataset later

## Evaluate Metadata using Citation

Checks whether metadata and citation matches and lists potential issues for reviewer

AI is helpful in parsing and doing similarity checks while flagging issues

In [104]:
def evaluate_citation_vs_metadata(row: pd.Series):
    """
    Use LLM to decide if Citation matches key metadata.
    Returns: (matches_metadata: bool, issue_text_or_None)
    """
    citation = row.get("Citation", None)
    if pd.isna(citation) or not str(citation).strip():
        # No citation text at all → treat as not matching
        return False, "Missing citation text"

    payload = {
        "faculty_contributor": str(row.get("Faculty Contributor", "")),
        "Co-Authors": str(row.get("Co-Authors", "")),
        "title": str(row.get("Title", "")),
        "publication_type": str(row.get("Publication Type", "")),
        "year": int(row.get("Year")) if pd.notna(row.get("Year")) else None,
        "volume": str(row.get("Volume", "")) if pd.notna(row.get("Volume", "")) else "",
        "issue": str(row.get("Issue", "")) if pd.notna(row.get("Issue", "")) else "",
        "pages": str(row.get("Pages", "")) if pd.notna(row.get("Pages", "")) else "",
        "edition": str(row.get("Edition", "")) if pd.notna(row.get("Edition", "")) else "",
        "citation": str(citation),
    }

    system_msg = (
        "You are an expert legal bibliographer who understands Bluebook citation rules. "
        "You are given structured metadata for a publication and a citation string. "
        "Your job is ONLY to compare them for internal consistency. "
        "Do NOT look up external information. Do NOT guess missing facts. "
        "Treat abbreviated names/journal titles as matching if they clearly refer to the same person/journal."
    )

    user_prompt = f"""
Here is the metadata for a publication (JSON):

{json.dumps(payload, ensure_ascii=False, indent=2)}

Compare this metadata to the citation field (the 'citation' property).

Check:
- Does the citation appear to describe the same work as the metadata?
- Are there any clear conflicts in:
  * author (faculty_contributor),
  * title,
  * year,
  * volume / issue / pages / edition (when present),
  * publication type (article vs book vs chapter vs journal, etc.)?

Ignore minor punctuation or spacing differences.

Return ONLY JSON in this format:
{{
  "matches_metadata": true or false,
  "issues": ["short explanation of each issue you see, if any"]
}}
"""

    resp = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_prompt},
        ]
    )

    content = resp.choices[0].message.content

    try:
        data = json.loads(content)
    except json.JSONDecodeError:
        return False, "LLM response not valid JSON"

    matches = bool(data.get("matches_metadata", False))
    issues_list = data.get("issues", []) or []

    if matches:
        return True, None

    issue_text = "; ".join(issues_list) if issues_list else "Citation and metadata appear inconsistent"
    return False, issue_text

In [105]:

df_enriched["matches_metadata"] = False
df_enriched["metadat_issues"] = pd.NA

subset = df_enriched[df_enriched["Citation"].notna()].head(20)

for idx, row in subset.iterrows():
    matches, issue_text = evaluate_citation_vs_metadata(row)
    df_enriched.at[idx, "matches_metadata"] = matches

    if not matches and issue_text:
        df_enriched.at[idx, "metadat_issues"] = issue_text
    else:
        df_enriched.at[idx, "metadat_issues"] = pd.NA

display(df_enriched[["Faculty Contributor", "Title", "Year", "Citation", "matches_metadata", "metadat_issues"]].head(5))


Unnamed: 0,Faculty Contributor,Title,Year,Citation,matches_metadata,metadat_issues
0,"Ablavsky, Gregory",Speculation Nation: Land Mania in the Revoluti...,2024,"Gregory Ablavsky, Speculation Nation: Land Man...",True,
1,"Ablavsky, Gregory",Cohen's Handbook of Federal Indian Law,2024,"Fᴇʟɪx S. Cᴏʜᴇɴ, Cᴏʜᴇɴ'ꜱ Hᴀɴᴅʙᴏᴏᴋ ᴏꜰ Fᴇᴅᴇʀᴀʟ Iɴ...",False,"Title mismatch: metadata title is ""Cohen's Han..."
2,"Ablavsky, Gregory",The Original Meaning of Commerce in the Indian...,2024,"Gregory Ablavsky, The Original Meaning of Comm...",True,
3,"Ablavsky, Gregory",Clarence Thomas Went After My Work. His Critic...,2023,"Gregory Ablavsky, Clarence Thomas Went After M...",True,
4,"Ablavsky, Gregory",Akhil Amar's Unusable Past,2023,"Gregory Ablavsky, Akhil Amar's Unusable Past, ...",True,


Tested on a subset of data.

This is useful in flagging errors and mistakes in Data Collection step