## The Jeffrey Epstein Documents: Named Entity Search

By: Shirsho Dasgupta (2023)

The code reads the unsealed and unredacted court records (released January, 2024) and uses a named entity recognition model (spaCy) to sift through them and highlight names by entity-types. This enabled Miami Herald reporters to refer to a makeshift index of names and report out stories quickly. 

The final spreadsheet can be found [here](https://github.com/shirshod/epstein_records/tree/main/epstein_index.csv).

#### Note: 
1. This example uses only a fraction (~100) of all of the unsealed court records that were released.
2. Each of the models (large, medium and small) had minor discrepancies in identifying names. To capture the most number, the code runs all three models, combines them and then deletes duplicates if they are from the same file. 
3. The models classified some people as "ORG" and some organizations as "PERSON".

### Installing libraries

In [1]:
import os
import glob 

import pandas as pd

import fitz 
from PIL import Image
import pytesseract


import spacy
from spacy import displacy
import en_core_web_sm
import en_core_web_md
import en_core_web_lg



### Creating dataframe of records

In [2]:
### creates list of files
files = glob.glob("records/*.pdf")

In [3]:
### creates a blank dataframe to write to
result_df = pd.DataFrame(columns = ["FILE", "TEXT"])

In [4]:
### loop runs through list of files
for file_path in files:
    
    ### stores file name
    file_name = os.path.basename(file_path)

    ### try-except block handles exceptions
    try:
        
        ### opens each file
        pdf_document = fitz.open(file_path)
        ### initializes variable to write into
        text = ""
        
        ### loop runs through number of pages in the file
        for page_num in range(pdf_document.page_count):
            
            ### stores the contents of the page
            page = pdf_document[page_num]
            ### appends the text from that page
            text += page.get_text()
            ### controls for blank spaces and tabs
            text = text.replace("\n", "")
            
    ### prints error message and moves on to next file for exception
    except Exception as e:
        print(f"Error processing {file_name}: {e}")
        continue

    ### writes the results into a dataframe
    result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)

  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text},

  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text},

  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)
  result_df = result_df.append({"FILE": file_name, "TEXT": text}, ignore_index=True)


In [5]:
### displays results
result_df

Unnamed: 0,FILE,TEXT
0,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,United States District CourtSouthern District ...
1,1_15-cv-07433-LAP-2024-01-04-DN1322-NOTICE_of_...,"January 4, 2024 VIA ECF The Honorab..."
2,1_15-cv-07433-LAP-2024-01-04-DN1322-NOTICE_of_...,EXHIBIT N Case 1:15-cv-07433-LAP Document 13...
3,gov.uscourts.nysd.447706.1325.3_1.pdf,EXHIBIT C Case 1:15-cv-07433-LAP Document 13...
4,6d466fcf-63b9-4419-8c78-4dab83fadaad.pdf,United States District Court Southern Distri...
...,...,...
58,1198-7 DE 258-4.pdf,EXHIBIT 4 (File Under Seal) Case 1:15-cv-...
59,1.pdf,Case 1:15-cv-07433-LAP Document 1090-1 Fil...
60,3.pdf,United States District CourtSouthern District ...
61,2.pdf,UNITED STATES DISTRICT COURT SOUTHERN DISTRICT...


### Running small language model

In [6]:
### creates dataframe to write to
df_sm = pd.DataFrame(columns = ["FILE", "TEXT", "LEMMA", "ENTITY"])
### loads model
nlp = en_core_web_sm.load()

In [7]:
### loop runs through dataframe
for i in range(0, len(result_df)):
    
    ### stores text
    text = result_df["TEXT"][i]
    ### stores file name
    file_name = result_df["FILE"][i]
    
    ### implements model on the text
    doc = nlp(text)
    
    ### stores the entities and attributes
    entities = [(file_name, ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
    
    ### creates a dataframe to write into
    df_init = pd.DataFrame(entities, columns = ["FILE", "TEXT", "LEMMA", "ENTITY"])
    ### appends the dataframe created outside the loop
    df_sm = pd.concat([df_sm, df_init], ignore_index = True)

### displays results
df_sm

Unnamed: 0,FILE,TEXT,LEMMA,ENTITY
0,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,United States,GPE,United States
1,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,New York,GPE,New York
2,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Virginia L. Giuffre,PERSON,Virginia L. Giuffre
3,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,15,CARDINAL,15
4,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Ghislaine Maxwell,PERSON,Ghislaine Maxwell
...,...,...,...,...
28251,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,clinton,PERSON,clinton
28252,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,Andrew,PERSON,Andrew
28253,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,monday,DATE,monday
28254,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,2,CARDINAL,2


### Running medium language model

In [8]:
### creates dataframe to write to
df_md = pd.DataFrame(columns = ["FILE", "TEXT", "LEMMA", "ENTITY"])
### loads model
nlp = en_core_web_md.load()



In [9]:
### loop runs through dataframe
for i in range(0, len(result_df)):
    
    ### stores text
    text = result_df["TEXT"][i]
    ### stores file name
    file_name = result_df["FILE"][i]
    
    ### implements model on the text
    doc = nlp(text)
    
    ### stores the entities and attributes
    entities = [(file_name, ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
    
    ### creates a dataframe to write into
    df_init = pd.DataFrame(entities, columns = ["FILE", "TEXT", "LEMMA", "ENTITY"])
    ### appends the dataframe created outside the loop
    df_md = pd.concat([df_md, df_init], ignore_index = True)

### displays results
df_md

Unnamed: 0,FILE,TEXT,LEMMA,ENTITY
0,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,United States,GPE,United States
1,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,New York,GPE,New York
2,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Virginia L. Giuffre,PERSON,Virginia L. Giuffre
3,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,15-cv-07433-RWS,CARDINAL,15-cv-07433-rws
4,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Ghislaine Maxwell,PERSON,Ghislaine Maxwell
...,...,...,...,...
40437,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,clinton,PERSON,clinton
40438,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,Andrew,PERSON,Andrew
40439,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,monday,DATE,monday
40440,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,2,CARDINAL,2


### Running large language model

In [10]:
### creates dataframe to write to
df_lg = pd.DataFrame(columns = ["FILE", "TEXT", "LEMMA", "ENTITY"])
### loads model
nlp = en_core_web_lg.load()



In [11]:
### loop runs through dataframe
for i in range(0, len(result_df)):
    
    ### stores text
    text = result_df["TEXT"][i]
    ### stores file name
    file_name = result_df["FILE"][i]
    
    ### implements model on the text
    doc = nlp(text)
    
    ### stores the entities and attributes
    entities = [(file_name, ent.text, ent.label_, ent.lemma_) for ent in doc.ents]

    ### creates a dataframe to write into
    df_init = pd.DataFrame(entities, columns = ["FILE", "TEXT", "LEMMA", "ENTITY"])
    ### appends the dataframe created outside the loop
    df_lg = pd.concat([df_lg, df_init], ignore_index = True)

### displays results
df_lg

Unnamed: 0,FILE,TEXT,LEMMA,ENTITY
0,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,United States,GPE,United States
1,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,New York,GPE,New York
2,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Virginia L. Giuffre,PERSON,Virginia L. Giuffre
3,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,15,CARDINAL,15
4,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Ghislaine Maxwell,PERSON,Ghislaine Maxwell
...,...,...,...,...
40992,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,Andrew,PERSON,Andrew
40993,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,monday,DATE,monday
40994,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,DAILY,DATE,DAILY
40995,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,2,CARDINAL,2


### Combining models and dropping duplicates from same document

In [12]:
final_df = pd.concat([df_sm, df_md, df_lg], ignore_index = True)

### drops duplicate entries if they are from the same file
final_df = final_df.drop_duplicates(subset = ["FILE", "TEXT"])

### displays results
final_df

Unnamed: 0,FILE,TEXT,LEMMA,ENTITY
0,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,United States,GPE,United States
1,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,New York,GPE,New York
2,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Virginia L. Giuffre,PERSON,Virginia L. Giuffre
3,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,15,CARDINAL,15
4,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Ghislaine Maxwell,PERSON,Ghislaine Maxwell
...,...,...,...,...
109659,2.pdf,University of Utah 383 S. University Street Sa...,ORG,University of Utah 383 S. University Street Sa...
109667,2.pdf,North Andrews Ave.,LOC,North Andrews Ave.
109676,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,"Saturday, January 10, 2015 9:00 AM",DATE,"Saturday, January 10, 2015 9:00 am"
109686,efbaca53-ffb4-4064-98ea-bb5ab995f700.pdf,Philip plse,PERSON,Philip plse


In [13]:
### gets value counts for each entity-type
final_df["LEMMA"].value_counts()

CARDINAL       9184
ORG            5038
DATE           4121
PERSON         4036
GPE            1244
PRODUCT         501
TIME            395
QUANTITY        384
MONEY           379
WORK_OF_ART     292
NORP            266
LAW             206
PERCENT         202
ORDINAL         194
FAC             189
LOC             129
EVENT            63
LANGUAGE         14
Name: LEMMA, dtype: int64

### Finalizing dataframe

In [14]:
filters = ["PERSON", "ORG"]

### filters for persons and organizations
final_ents = final_df[final_df["LEMMA"].isin(filters)].reset_index(drop = True)

### displays results
final_ents

Unnamed: 0,FILE,TEXT,LEMMA,ENTITY
0,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Virginia L. Giuffre,PERSON,Virginia L. Giuffre
1,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Ghislaine Maxwell,PERSON,Ghislaine Maxwell
2,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,SCHILLER & FLEXNER LLP,ORG,SCHILLER & FLEXNER LLP
3,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Sigrid McCawley,PERSON,Sigrid McCawley
4,68d6e23f-9ac8-479f-b0b0-08c91e019e4a.pdf,Pro Hac Vice)Meredith,ORG,Pro Hac vice)meredith
...,...,...,...,...
9069,2.pdf,LLP,ORG,LLP
9070,2.pdf,FL 33301,ORG,FL 33301
9071,2.pdf,S.J. Quinney College of Law,ORG,S.J. Quinney College of Law
9072,2.pdf,University of Utah 383 S. University Street Sa...,ORG,University of Utah 383 S. University Street Sa...


In [15]:
final_ents.to_csv("epstein_index.csv", index = False)