## Code associated with **A digital archive reveals how a funding agency cooperated with academics to support a nascent field of science.**

Contact: spencerhong@u.northwestern.edu

The pipeline here starts roughly as follows:
1. Put archival artifacts into Tiramisu
2. Run PDF splitting and file conversion
3. Page stream segmentation task
4. Handwriting extraction
5. Text extraction
6. Entity recognition & disambiguation & redaction

We show below how to achieve each step. If you would like a streamlined, automated pipeline, our next platform as part of the Born Physical, Studied Digitally NSF consortium may be of interest.

You can request the Core Collection of the NHGRI archive by following the data availability section of the manuscript. For this tutorial, we use an example dataset of business documents sourced from the Industry Documents Library under Fair Use.

### Step One - Put archival artifacts into Tiramisu 

Please set the path that contains your archive/corpus that you wish to track and convert. In this example, we will do so by setting `TIRAMISU_ROOT` in [`core/.env`](tiramisu/core/.env) as `../../example_data`. Also set your favorite `NEO4J_PASSWORD` in the same file.

If you are on a AArch64 architecture, please move [`digest-0.1.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl`](tiramisu/core/rust/target/options/digest-0.1.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl) inside [`core/rust/target/options`](tiramisu/core/rust/target/options) to [`core/rust/target/wheels`](tiramisu/core/rust/target/wheels). Check that only one wheel file exists there. 

Then build and start Tiramisu inside `tiramisu` folder with  

```bash
docker-compose -f core/docker-compose_aarch64.yaml up --build
```

If you are on a x86_64 architecture, please move [`digest-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl`](tiramisu/core/rust/target/options/digest-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl) to [`core/rust/target/wheels`](tiramisu/core/rust/target/wheels). Check that only one wheel file exists there. 

Then build and start Tiramisu inside `tiramisu` folder with  

```bash
docker-compose -f core/docker-compose_x86_64.yaml up --build
```
If you run into the LabelStudio permission issues in the logs, you can check out the [FAQs](tiramisu/README.md#FAQs)

Neo4J service and the frontend will be the last to start up. Once you see

<img src="imgs/neo4j_start.png" width="800" /> 

and 

<img src="imgs/frontend_start.png" width="800" />,   
you're ready to head to http://localhost:8080! It should look like below.


<img src="imgs/tiramisu_actions.png" width="800" />   


Here, you can start the initial step by clicking the button "digest". Please only click once as multiple clicks will repeat the same action and crowd out your graph database. Once finished, you can check that digestion has finished by checking the task dashboard at http://localhost:8080/flower/dashboard or going to the graph database at http://localhost:7474 and running the following query:

```cypher

MATCH (n) RETURN n

```

It will look like below.


<img src="imgs/neo4j_digest.png" width="800" />   

These artifacts are now part of Tiramisu! You're ready to start converting and preprocessing for data extraction.

### Step Two - Run PDF splitting and file conversion

Tiramisu has already built-in functions that will convert MS Word documents to PDFs, split multi-page PDFs to single-page PDFs (this will later come in handy for page stream segmentation), and convert all PDFs to images (for future OCR & labeling). It will also determine what type of PDF (e.g. born-physical PDFs which are scanned, and digital-native PDFs) and add as metadata. 

To do all of the following, you can simply click the rest of the buttons (e.g. `CONVERT MS TO PDFS`, `SPLIT PDFS`, etc) buttons **in order** of shown in the Tiramisu Actions page. **Make sure each task is finished before clicking the next task.** When finished, your graph database will look like the following:

Tiramisu has already built-in functions that will convert MS Word documents to PDFs, split multi-page PDFs to single-page PDFs (this will later come in handy for page stream segmentation), and convert all PDFs to images (for future OCR & labeling). It will also determine what type of PDF (e.g. born-physical PDFs which are scanned, and digital-native PDFs) and add as metadata. 

To do all of the following, you can simply click the rest of the buttons (e.g. `) button in the Tiramisu Actions page. When finished, your graph database will look like the following:


<img src="imgs/neo4j_processed.png" width="800" />   

How do you visualize some of these documents? Let's again use Tiramisu.

Let's say we want to look at an artifact with an unique nodeID of `0x3dee1ff7+++0x86a9fb43`, which is page 4 of the processed PDF `ppwl0228.pdf` in `/tiramisu/fossil_fuel` (try to trace back these steps using the graph database!).

First, let's head to http://localhost:8085. This is our labeling interface using LabelStudio. You can create an account (which is stored locally, so no online data transfer) and then go to `Account & Settings` and copy API access token. 

We can input these parameters in the Tiramisu Actions page for `Visualize a specific PDF or image`. `api` field is the access token, `nodeID` is the specific nodeID you'd like to visualize, and `configuration` is either `image` to visualize an image, or `pdf` to visualize a pdf. When the button is clicked, it will create a labeling task inside LabelStudio, which will look like this:

<img src="imgs/labelstudio.png" width="800" />   

### Step Three - Page stream segmentation task

Page stream segmentation is a computer vision task that attempts to separate combined PDFs into logically separate documents. This is often an artifact of high-throughput scanning systems and quite prevalent in born-physical archives.

While we have trained our own page stream segmentation model using synthetic data (contact us for more info), the NHGRI Core Collection was split manually using this following procedure. Upon request, the separated document boundaries are available as secondary data from NHGRI.

First, we must prepare a corpus to be labeled. We can use Tiramisu's API endpoint to query the necessary PDFs to be labeled using LabelStudio.

In [1]:
import json
from urllib import request
import time
import pandas as pd
import requests

In [39]:
## This query does not get PDFs converted from MS documents as they are one logical document per one file
GET_PDFS_ORDERED = """MATCH (e) - [:CONTAINS] -> (c) - [:SPLIT_INTO] -> (n) - [:CONVERT_TO] -> (d)
WHERE n.fileExtension = "pdf"
RETURN c.originalPath as originalPDF, c.nodeID as originalNodeID, n.nodeID as nodeID, n.page as page, d.tiramisuPath as image
ORDER BY originalNodeID ASC, page ASC"""

def return_from_neo4j(query):
	"""This function returns the objects from Tiramisu's graph database given a valid query."""
	
	digest_list = [
	{

		"action": "query_neo4j",
		 'kwargs': {'query': query}
	}]
	
	data = json.dumps({ 
			"action_list": digest_list 
		}).encode()
	
	req = request.Request("http://localhost:8080/api/action/concurrent", data)
	req.add_header("Content-Type", "application/json")
	res = request.urlopen(req)
	out_data = res.read()
	result = json.loads(out_data)
	
	time.sleep(1)
	
	response = request.urlopen("http://localhost:8080/api/status/" + result['task_id'][0])

	finished = False
	while not finished:
		time.sleep(1)

		finished = check_status("http://localhost:8080/api/status/" + result['task_id'][0])
	response = request.urlopen("http://localhost:8080/api/status/" + result['task_id'][0])
	data = json.loads(response.read())
	
	return pd.DataFrame(data['task_result'])

def submit_to_tiramisu(query):
    digest_list = [
    {
        "action": "write_neo4j",
         'kwargs': {'query': query}
    }]
    data = json.dumps({ 
            "action_list": digest_list 
        }).encode()
    req = request.Request("http://localhost:8080/api/action/concurrent", data)
    req.add_header("Content-Type", "application/json")
    res = request.urlopen(req)
    out_data = res.read()
    result = json.loads(out_data)

def check_status(url):
	"""This function checks the status of the task."""
	r = requests.get(url)
	status = r.json()['task_status']

	return status == 'SUCCESS'

In [13]:
# Let's see all the PDFs ready for page stream segmentation.

results = return_from_neo4j(GET_PDFS_ORDERED)
results.head(3)

Unnamed: 0,image,nodeID,originalNodeID,originalPDF,page
0,/tiramisu/.tiramisu/___tiramisu_versions/0x188...,0xf2507a6a+++0x3a05a53a,1202194586+++173522494,/tiramisu/food/ythh0257.pdf,0
1,/tiramisu/.tiramisu/___tiramisu_versions/0x188...,0xb2404f64+++0x3a05a53a,1202194586+++173522494,/tiramisu/food/ythh0257.pdf,1
2,/tiramisu/.tiramisu/___tiramisu_versions/0x188...,0x7bea55b6+++0x3a05a53a,1202194586+++173522494,/tiramisu/food/ythh0257.pdf,2


In [17]:
# Now create valid sequential pairs for labeling.

# Sort by nodeID and page
results = results.sort_values(by=['originalNodeID', 'page']).reset_index(drop=True)

        
# Generate overlapping pairs
pairs = []
for i in range(len(results) - 1):
    current_row = results.iloc[i]
    next_row = results.iloc[i + 1]
    
    # If nodeID changes, add an empty second pair element
    if current_row['originalNodeID'] != next_row['originalNodeID']:
         pairs.append({
            "image1": current_row['image'],
            "image2": None,
            "originalNodeID": current_row['originalNodeID'],
            "page1": current_row['page'],
            "page2": None,
            "originalPDF": current_row['originalPDF']
        })
    else:
         pairs.append({
            "image1": current_row['image'],
            "image2": next_row['image'],
            "originalNodeID": current_row['originalNodeID'],
            "page1": current_row['page'],
            "page2": next_row['page'],
            "originalPDF": current_row['originalPDF']
        })

In [18]:
results.tail(3)

Unnamed: 0,image,nodeID,originalNodeID,originalPDF,page
41,/tiramisu/.tiramisu/___tiramisu_versions/0x3de...,0xd0b59d09+++0x1c75e10b,565574989+++1900608366,/tiramisu/fossil_fuel/ppwl0228.pdf,4
42,/tiramisu/.tiramisu/___tiramisu_versions/0x826...,0xf24a1fd+++0x1df57eef,647597939+++1067243026,/tiramisu/chemical/ljfd0346.pdf,0
43,/tiramisu/.tiramisu/___tiramisu_versions/0x826...,0x253d9781+++0x1df57eef,647597939+++1067243026,/tiramisu/chemical/ljfd0346.pdf,1


In [22]:
pairs[4]

{'image1': '/tiramisu/.tiramisu/___tiramisu_versions/0x66efcb5a+++0xc29ab2c5/gqxp0324_page_0.png',
 'image2': '/tiramisu/.tiramisu/___tiramisu_versions/0x66efcb5a+++0x2e84fafa/gqxp0324_page_1.png',
 'originalNodeID': '144429932+++4128447885',
 'page1': 0,
 'page2': 1,
 'originalPDF': '/tiramisu/opioid/gqxp0324.pdf'}

Now that you have the pairs in JSONL format, we can add this back to LabelStudio for labeling for page stream segmentation. You can add the following parameters into Tiramisu Actions as such:

<img src="imgs/parameters.png" width="800" />   

Then, LabelStudio will look like this for your labeling task:

<img src="imgs/labeling.png" width="800" />   

Once labeled, you can update Tiramisu with a new `Document` node type which gets used in downstream analyses.

In [26]:
import zlib

def crc32(data):
	data = bytes(data, 'UTF-8')

	return hex(zlib.crc32(data) & 0xffffffff)  # crc32 returns a signed value, &-ing it will match py3k



In [36]:
# we have a fake labeled dataset to upload to Tiramisu
# candidates are originalPDF paths
candidates = [
    "/tiramisu/food/ythh0257.pdf",
    "/tiramisu/opioid/gqxp0324.pdf"
]

# note the inclusive left end (starting from page 1, not 0) and exclusive right end
candidate_pages = [
    [[0,2], # pages 1 to 2 of ythh0257.pdf are part of one document
     [2,4]],# pages 3 to 4 of ythh0257.pdf are part of one document        
    [[0,1], # pages 1 of gqxp0324.pdf are part of one document
     [1,6]] # pages 2 to 6 of gqxp0324.pdf are part of one document
]

In [49]:
for doc, path in zip(candidate_pages, candidates):
    for pages in doc:
        
        newID = crc32(path) + "_" + str(pages[0]) + "_" + str(pages[-1]-1)
        print(f"new documentID: {newID}")
        submit_to_tiramisu(f"""
                    MERGE (a:Document {{nodeID: "{newID}"}})   
                    """)
        time.sleep(2) # to ensure that the NEO4J query for document creation runs first
        for i in range(pages[0], pages[-1]):
            submit_to_tiramisu(f"""
                    match (e:Folder) - [:CONTAINS] -> (n:File) - [:SPLIT_INTO] -> (c:File) 
                    where n.fileExtension = "pdf" and n.originalPath = 
                    "{path}" 
                    and c.page = {i} 
                    MATCH (a:Document {{nodeID: "{newID}"}})   
                    CREATE (c) - [:PART_OF] -> (a)
                    """)

new documentID: 0x137f9419_0_1
new documentID: 0x137f9419_2_3
new documentID: 0xf76d7f16_0_0
new documentID: 0xf76d7f16_1_5


Once created, the documents will look like the following:

<img src="imgs/document_creation.png" width="800" />   

You can always undo the document creation step by running the following query:
```cypher
MATCH (n) - [r:PART_OF] -> (c:Document)  DELETE r, c
```
   

### Step Four - Handwriting extraction

Some PDFs have handwriting which must be removed both to increase the accuracy of OCR and to mitigate the potential risk of re-identifying individuals. We have provided all the steps to create synthetic handwriting training data, the training script, and the inference code to remove handwriting all in a separate folder called [`handwriting_extraction`](handwriting_extraction/README.md). 


### Step Five - Text extraction

We can extract text from the scanned PDFs, digital-native PDFs, and Microsoft Documents. We use Apache Tika and TesseractOCR.

We first get all the necessary files from Tiramisu for extraction. The requirements for this step is listed at [`requirements.txt`](entity_recognition/requirements.txt).

In [None]:
from tqdm import tqdm
from tika import parser
import pytesseract
from pathlib import Path

import shutil

TIRAMISU_PATH = #set your local path that contains .tiramisu
all_pdfs = return_from_neo4j("""
match (n:Folder) - [:CONTAINS] -> (e:File) - [:SPLIT_INTO] -> (c:File) - [:CONVERT_TO] -> (f:File) 
where e.fileExtension = 'pdf' and f.fileExtension = 'png' 
return c.nodeID as nodeID, c.tiramisuPath as tiramisu_path, c.name as name, f.tiramisuPath as image_path, c.scanned as scanned
""")
all_pdfs['local_path'] = all_pdfs['tiramisu_path'].apply(lambda x: (Path(TIRAMISU_PATH) / Path(x).relative_to('/tiramisu/')).as_posix())
all_pdfs['image_local_path'] = all_pdfs['image_path'].apply(lambda x: (Path(TIRAMISU_PATH) / Path(x).relative_to('/tiramisu/')).as_posix())

all_ms = return_from_neo4j("""
match (n:Folder) - [:CONTAINS] -> (e:File) 
where e.fileExtension in ['xls', 'xlsx', 'ppt', 'pptx', 'doc', 'docx'] 
return e.nodeID as nodeID, e.tiramisuPath as tiramisu_path, e.name as name, e.fileExtension as file_extension
""")
all_ms['local_path'] = all_ms['tiramisu_path'].apply(lambda x: (Path(TIRAMISU_PATH) / Path(x).relative_to('/tiramisu/')).as_posix())



We have assumed that you have already run handwriting extraction on scanned PDFs by sending the following results to handwriting extraction step: 
```python
all_pdfs.loc[all_pdfs.scanned == True]
```

We first Apache Tika all MS documents by using the hosted Tika at http://localhost:9998

In [None]:
for i, row in tqdm(all_ms.iterrows(), total = all_ms.shape[0]):
    parsed = parser.from_file(row['local_path'], 'http://localhost:9998/tika')
    with open(f"ms_tika/{row['nodeID']}.json", "w") as f:
        json.dump(parsed, f)

We then OCR the scanned and electronic PDFs. We OCR electronic PDFs because the Core Collection has lots of digital -native PDFs that contain images (which are not part of the digital content). 

In [None]:
for i, row in tqdm(all_pdfs.loc[all_pdfs.scanned == False].iterrows(), total = all_pdfs.loc[all_pdfs.scanned == False].shape[0]):
    config = "-l eng --oem 1 --psm 3  -c preserve_interword_spaces=1"
    nodeID = row['nodeID']
    path = row['image_local_path']
    if not Path("pdf_ocr/" + nodeID + ".txt").exists():
        with open("pdf_ocr/" + nodeID + ".txt", "w") as f:
            f.write(pytesseract.image_to_string(Image.open(path), config = config))
            
# redacted images is from handwriting extraction
for path in tqdm(Path("redacted_images/").glob("*.png"),\
                 total = all_pdfs.loc[all_pdfs.scanned == True].shape[0]):
    nodeID = path.stem
    config = "-l eng --oem 1 --psm 3  -c preserve_interword_spaces=1"
    
    if not Path("pdf_ocr/" + nodeID + ".txt").exists():
        with open("pdf_ocr/" + nodeID + ".txt", "w") as f:
            f.write(pytesseract.image_to_string(Image.open(path.as_posix()), config = config))

We then pool all the text in one folder.

In [None]:
for text in Path("ms_tika/").glob("*.json"):
    with open(text, "r") as f:
        data = json.load(f)

    with open("ms_text/" + text.stem + '.txt', "w" ) as f:
        if data['content'] is not None:
            f.write(data['content'].strip())

# copying the OCR text 
for i in Path("pdf_ocr").glob("*.txt"):
    shutil.copy(i, f"all_text/{i.name}")

### Step Six - Entity recognition & disambiguation & redaction

We use spaCy's entity recognition pipeline to train our own entity recognition models based on NHGRI data. We fine-tune models with small labeled samples of the Core Collection. We have provided the training script, hyperparameters, and dataset creation scripts all in a separate folder called [`entity_recognition`](entity_recognition/README.md).

Once the entities are detected, we can apply entity disambiguation. We use a "seed and expand" approach where we start with a known list of individuals with their aliases, and then we use fuzzy matching to match against those known list. Then, the remaining ones are separated into separate individuals with their own aliases again with fuzzy matching.

The process is as follows:

1. generate aliases from a starting list of individuals
2. match all detected names to these aliases
3. match remainder of names to these aliases by edit distance (fuzzy matching)
4. find valid new names that represent new individuals from remainder
5. match valid new names to one another by edit distance (fuzzy matching)
6. match the remainder of names to the valid new names with generated aliases


First create a list of starting individuals at `starting_individuals.txt`. The text file looks like:

    john smith; johnny smith; j. c. smith    
    jane doe; jan doe; jane c. doe

where each line represents a known individual and the `;` separate known aliases.

Also create a list of individuals that are easily missed, perhaps due to uncommon name structure at `missed_strings.txt`. The text file looks like:

    peter john david; p. john de david; P. John de David   
    la caprio; lacaprio; Hela LaCaprio
    
where each line represents a known individual and the last element is the name you wish to associate the known aliases with.

In [None]:
import re
import pandas as pd
from unidecode import unidecode

import spacy
spacy.require_gpu()
from collections import defaultdict

from tqdm import tqdm

from rust_utils import _sliding_window

from polyfuzz.models import EditDistance
from rapidfuzz import fuzz

import json


In [None]:
IDs = {}
with open("knowledge_base/starting_individuals.txt", "r",  encoding="utf-8") as f:
    for i, line in enumerate(f):
        IDs[i+1] = [i.strip() for i in line.split(";")]
        
# difficult disambiguations should be manually put (or ones you are for sure about!)
missed = {}
with open("knowledge_base/missed_strings.txt", "r",  encoding="utf-8") as f:
    for i, line in enumerate(f):
        lines = [i.strip() for i in line.split(";")]
        
        for i in lines[:-1]:
            
            missed[i] = lines[-1]

In [None]:
# remove prefixes/suffixes 
prefixes_suffixes = ["mr.", "mr", "mrs", "mrs.", "dr", "dr.", "phd", "ph.d", "ph.d.", "ms", "ms.", "m.p.h.", "mph", "m.p.h", "mister", "miss", "doctor","frs", "professor", "prof", "prof."]

prefix_suffix_pattern = r'\b(?:' + "|".join(map(re.escape, prefixes_suffixes)) + r')\b'

# normalize all names before creating aliases
def normalize_name(name):

    # a very long name, most likely an error in detecting a name
    if len(name) > 40:
        return ""

    
    # remove all prefix and suffix
    cleaned_name = re.sub(prefix_suffix_pattern, '', unidecode(name.lower()))

    # remove every leading and trailing commas
    cleaned_name = re.sub(r'^[^a-zA-Z]+|[^a-zA-Z]+$', '', cleaned_name)
    
    # remove everything that is not a alphabetic character and a comma
    # this also removes periods, hyphens, and dashes
    cleaned_name = re.sub(r"[^a-zA-Z\s,]" ,'', cleaned_name)

    # if there is a comma left, it's an inverted name
    if "," in cleaned_name:
        
        # if there is a comma, there should only be one as its usually lastname, first name
        if len(cleaned_name.split(",")) > 2:
            return ""
        else:
            first = cleaned_name.split(",")[0]
            last = cleaned_name.split(",")[-1]

            if len(first) == 1 or len(last) == 1:
                return ""
            else:
                return last.strip() + " " + " ".join(cleaned_name.split(",")[1:-1]).strip() +  first.strip()
    
    return cleaned_name

# with normalized names, find alises that match those names
def provide_aliases(name):
 
    aliases = []
    
    firstname = name.split()[0]
    lastname = name.split()[-1]
    middlename = " ".join(name.split()[1:-1])
    
    aliases.append(name)
    
    # first initial lastname
    aliases.append(firstname[0]+ " " + lastname)
    
    # firstname lastname
    aliases.append(firstname + " " + lastname)
    
    # firstname lastname no spaces
    aliases.append(firstname+lastname)

    # first initial lastname no spaces
    aliases.append(firstname[0] + lastname)
    
    # last name, first name
    aliases.append(lastname + ", " + firstname)
    

    # if there is more than an initial for middle name, find more possible aliases
    if len(middlename) > 0:
        aliases.append(firstname + " " + middlename[0] + " "+ lastname )
        aliases.append(firstname + " " + middlename + lastname)
        aliases.append(firstname[0] + " " + middlename[0] + " " + lastname)
        aliases.append(firstname + middlename[0] + " " + lastname)
        aliases.append(firstname[0]+middlename[0]+lastname)
        aliases.append(firstname[0]+middlename[0]+ " " + lastname)
        aliases.append(lastname + " "+ firstname[0] + middlename[0])
        if len(middlename) > 1:
            aliases.append(middlename + " "+ lastname)
            aliases.append(firstname + " "+ middlename)

    return aliases



In [None]:

alias_counter = {}
for ID, names in tqdm(IDs.items()):
    for i, name in enumerate(names):
        normalized = normalize_name(name)
        if i == 0:
            ID_name = normalized
        if normalized != "":
            if normalized not in alias_counter:
                alias_counter[normalized] = set([ID_name])
            else:
                alias_counter[normalized].add(ID_name)
        for alias in provide_aliases(normalized):
            if alias not in alias_counter:
                alias_counter[alias] = set([ID_name])
            else:
                alias_counter[alias].add(ID_name)

In [None]:
# we load the model from entity recognition step
nlp = spacy.load('../entity_recognition/data_curve/500_samples/runs_1/model-best/')
# load the texts from text extraction step
corpus = Path("all_text").glob("*.txt")

First, detect all individuals in the corpus. We use a sliding window function written in Rust; the source code is provided below. You can use [maturin](https://github.com/PyO3/maturin) to create a Python library.


```rust
use pyo3::prelude::*;

#[pyfunction]
fn _sliding_window(_py: Python, words: Vec<String>, window_size: usize, overlap_size: usize)
    -> PyResult<Vec<Vec<String>>>
{
    let mut result = Vec::new();
    if words.len() <= window_size {
        result.push(words)
    }
    else {
      
    let mut start = 0;

    while start + window_size <= words.len() {  
        let end = start + window_size;  
        let window: Vec<String> = words[start..end].to_vec();   
        result.push(window);   
 
        start += window_size - overlap_size;
    }
    let window: Vec<String> = words[start - overlap_size - overlap_size..words.len()].to_vec();   
    result.push(window);   
}
 Ok(result)
}   

#[pymodule]
fn rust_utils(_py: Python, m: &PyModule) -> PyResult<()> {   
    m.add_function(wrap_pyfunction!(_sliding_window, m)?)?;  
    Ok(())   
}  
```

In [None]:
all_names = defaultdict(set)
all_orgs = defaultdict(set)
for i, row in tqdm(corpus, total = len(corpus)):

    # use the sliding window approach in Rust
    # to ensure that all of the sequences fit into the spacy RoBERTa model
    windows = _sliding_window(row['text'].split(), 300, 50)
    
    for window in windows:
        doc = nlp(" ".join(window))
        
        for ent in doc.ents:
            if ent.label_ == "PERSON":
                all_names[row['nodeID']].add(ent.text)
            elif ent.label_ == "ORG":
                all_orgs[row['nodeID']].add(ent.text)

Now create a fuzzy matching model based on edit distance.

In [None]:
# create a model to find the distances from one another
# this grows at O(n^2) where n is the total number of detected names in a corpus
model = EditDistance(n_jobs=-1, scorer=fuzz.WRatio)

First match based on direct match to the known aliases.

In [None]:

not_matched = []

for name in [j for i in all_names.values() for j in i]:
    cleaned_name = normalize_name(name)
    
    if cleaned_name in alias_counter:
        continue

    if cleaned_name in missed:
        continue
    
    not_matched.append(cleaned_name)



In [None]:
# total number of unique names
len(set([j for i in all_names.values() for j in i]))

In [None]:
# number of names not matched by the initial starting list or their aliases
len(set(not_matched))

We now match all of the unmatched names to the aliases of the starting known list using fuzzy matching

In [None]:
%%time

matches = model.match(not_matched,
                      [i for i in alias_counter.keys()] )

In [None]:
# find names that were matched to an alias with a match score higher than 0.92

valid = matches.loc[matches.Similarity > .92]

valid['matched_to_full_name'] = valid['To'].str.split().str.len()


# dictionary of those matched to the original starting list
matched_to_starting_list = valid.loc[(valid.matched_to_full_name > 1)].groupby('From').agg({"To": set})['To'].to_dict()

In [None]:
# number of additional names from the unmatched pool that got matched to an alias
len(matched_to_starting_list)

Now get new individuals from the remainder.

In [None]:
not_matched_round_2 = []
new_individuals = []
for name in not_matched:

    # from the unmatched pool earlier,
    # now consider the name matched
    # if the name is matched from the edit distance grouping
    if name in matched_to_starting_list:
        continue
    
    split_name = name.split()
    if len(split_name) > 1:

        # if any of the first or lastnames are just initials, 
        # move to additional matching round
        if len(split_name[0]) == 1 or len(split_name[-1]) == 1:
            not_matched_round_2.append(name)

        # if both of the first and lastnames are more than initials,
        # consider the name to be a new individual
        elif len(split_name[0]) > 2 and len(split_name[-1]) > 2:
            new_individuals.append(name)

        # move to additional matching round
        else:
            not_matched_round_2.append(name)
    
    # move to additional matching round
    else:
        not_matched_round_2.append(name)


In [None]:
# return True if the two strings have higher than 92 Wratio score
def match(s1, s2):
    return fuzz.WRatio(s1, s2) >= 93

In [None]:
# from all of the possible new names 
# group similar ones together by edit distance
grs = [] 
for i, name in tqdm(enumerate(set(new_individuals)), total = len(set(new_individuals))):
    for j, g in enumerate(grs):
        if all(match(name, w) for w in g):
            g.append(name)
            break
    else:
        grs.append([name, ])

In [None]:

# from the similar groups
# take the longest string to be the key and rest to be aliases
# this is to ensure that we do not lost information

new_individual_list = {}
for i in grs:
    max_name = max(i, key = len)
    for j in i:
        if j not in new_individual_list:
            new_individual_list[j] = set([max_name])
        else:
            new_individual_list[j].add(max_name)

Now match the unmatched names by the generated aliases of the new individuals.

In [None]:
aliases_new = {}
for group in grs:
    ID_name = max(group, key = len)
    for name in group:
        normalized = normalize_name(name)
        if normalized != "":
            if normalized not in aliases_starting:
                aliases_new[normalized] = set([ID_name])
            else:
                aliases_new[normalized].add(ID_name)
        for alias in provide_aliases(normalized):
            if alias not in aliases_new:
                aliases_new[alias] = set([ID_name])
            else:
                aliases_new[alias].add(ID_name)

Now that we've created new aliases for new individuals, let's go through the whole process and disambiguate names.

In [None]:
total_names = []

names_with_initial_kb_and_aliases_matched = []

names_final_assigned = []


IDs_matched = []
nodeIDs = []

for nodeID, list_of_names in tqdm(all_names.items()):
    for name in list_of_names:

        # normalize the name
        cleaned_name = normalize_name(name)
        total_names.append(cleaned_name)

        # if the name was manually disambiguated
        if cleaned_name in missed:
            IDs_matched.append((cleaned_name, normalize_name(missed[cleaned_name]), nodeID))
            names_with_initial_kb_and_aliases_matched.append(normalize_name(missed[cleaned_name]))
            names_final_assigned.append(normalize_name(missed[cleaned_name]))
            continue

        # if the name is part of the aliases of the starting list of individuals
        if cleaned_name in aliases_starting:
            if len(list(aliases_starting[cleaned_name])) == 1:
                IDs_matched.append((cleaned_name, list(aliases_starting[cleaned_name])[0], nodeID))
                names_with_initial_kb_and_aliases_matched.append(list(aliases_starting[cleaned_name])[0])
                names_final_assigned.append(list(aliases_starting[cleaned_name])[0])
            else:
                IDs_matched.append((cleaned_name, "##PERSON##", nodeID))
                names_with_initial_kb_and_aliases_matched.append(cleaned_name)
                names_final_assigned.append(cleaned_name)
            continue

        # if the name was fuzzy matched to aliases of the starting list of individuals
        if cleaned_name in matched_to_starting_list:

            if len(list(alias_counter[list(matched_to_starting_list[cleaned_name])[0]])) == 1:
                IDs_matched.append((cleaned_name, list(alias_counter[list(matched_to_starting_list[cleaned_name])[0]])[0], nodeID))
                names_final_assigned.append(list(alias_counter[list(matched_to_starting_list[cleaned_name])[0]])[0])
            else:
                IDs_matched.append((cleaned_name, "##PERSON##", nodeID))
                names_final_assigned.append(cleaned_name)
            continue

        # if the name is part of the aliases of the new list of individuals 
        if cleaned_name in aliases_new:
            if len(list(aliases_new[cleaned_name])) == 1:
                IDs_matched.append((cleaned_name, list(aliases_new[cleaned_name])[0], nodeID))
                names_final_assigned.append(list(aliases_new[cleaned_name])[0])
            else:
                IDs_matched.append((cleaned_name, "##PERSON##", nodeID))
                names_final_assigned.append(cleaned_name)
            names_with_initial_kb_and_aliases_matched.append(cleaned_name)
            continue

        split_name = cleaned_name.split()

        # if still unmatched then check 
        if len(split_name) > 1:
            if len(split_name[0]) == 1 or len(split_name[-1]) == 1:
                IDs_matched.append((cleaned_name, "##PERSON##", nodeID))
                names_final_assigned.append(cleaned_name)
            elif len(split_name[0]) > 2 and len(split_name[-1]) > 2:
                IDs_matched.append((cleaned_name, list(new_individual_list[cleaned_name])[0], nodeID))
                names_final_assigned.append(list(new_individual_list[cleaned_name])[0])
            else:
                IDs_matched.append((cleaned_name, "##PERSON##", nodeID))
                names_final_assigned.append(cleaned_name)
        else:
            if cleaned_name in new_individual_list:
                IDs_matched.append((cleaned_name, list(new_individual_list[cleaned_name])[0], nodeID))
                names_final_assigned.append(list(new_individual_list[cleaned_name])[0])
            else:
                IDs_matched.append((cleaned_name, "##PERSON##", nodeID))
                names_final_assigned.append(cleaned_name)
        names_with_initial_kb_and_aliases_matched.append(cleaned_name)

You can view if this has worked well. If not, consider changing the score cutoff.

In [None]:
results = pd.DataFrame(IDs_matched, columns = ["input", "matched", "nodeID"])

results.loc[temp.input.str.contains("smith")].sample(50)

You can save the identifier system for downstream analyses.

In [None]:
pd.DataFrame(IDs_matched, columns = ["input", "matched", "nodeID"]).to_parquet("../../pii_detection/knowledge_base/matched_identifiers_240220.parquet")

In [None]:
with open("knowledge_base/new_individuals.txt", "w") as f:
    for i in set(new_individual_list.values()):
        f.write(i + "\n")
        
aliases_starting_json = {k:list(v) for k, v in aliases_starting.items()}
with open("knowledge_base/starting_individuals_aliases.json", "w") as f:
    json.dump(aliases_starting_json, f)



aliases_new_json = {k:list(v) for k, v in aliases_new.items()}
with open("knowledge_base/new_individuals_aliases.json", "w") as f:
    json.dump(aliases_new_json, f)

You can follow the same procedure for organizations.

We can now replace names of individuals by their disambiguated identifiers.

In [None]:
from string import punctuation
import re
import json
from unidecode import unidecode

whitespace_regex = re.compile(r"\s+", re.MULTILINE)
email_regex = re.compile(r'[%\w.+-—:]+@[\w-]+\.[\w.-]+')
parentheses_regex = re.compile(r'\[(?:[^\]]*)\]|\((?:[^)]*)\)')

prefixes_suffixes = ["mr.", "mr", "mrs", "mrs.", "dr", "dr.", "phd", "ph.d", "ms", "ms.", "mister", "miss", "doctor", "jr.", "jr", "frs"]

prefix_suffix_pattern = r'\b(?:' + "|".join(map(re.escape, prefixes_suffixes)) + r')\b'

class KnowledgeBase:
    """
    A class representing a knowledge base.

    Attributes:
    - knowledge_dict (dict): A dictionary to store knowledge entries.
    """

    def __init__(self):
        """
        Initialize a new KnowledgeBase object.
        """
        self.knowledge_dict = {}
        with open("knowledge_base/starting_individuals.txt", "r",  encoding="utf-8") as f:
            for i, line in enumerate(f):
                self.knowledge_dict[normalize_name(line.strip().split(';')[0].strip())]= i + 1
        total = i 

        with open("knowledge_base/new_individuals.txt", "r",  encoding="utf-8") as f:
            for i, line in enumerate(f):
                if line.strip() not in self.knowledge_dict:
                    self.knowledge_dict[normalize_name(line.strip())] = i+1+total
                else:
                    print(line.strip())

        with open("knowledge_base/starting_individuals_aliases.json", "r") as f:
            alias_counter = json.load(f)
        with open("knowledge_base/new_individuals_aliases.json", "r") as f:
            alias_counter_new_kb = json.load(f)

        self.alias_counter = alias_counter
        self.alias_counter_new_kb = alias_counter_new_kb
        
    def get_entry(self, name):
        """
        Retrieve an entry from the knowledge base.

        Args:
        - name: potential name to find in the knowledge base
        Returns:
        - The name and the value associated with the key, or an empty list if the key does not exist.
        """

        cleaned_name = normalize_name(name)

        if cleaned_name in self.IDs_merged:
            return cleaned_name, self.IDs_merged[cleaned_name]
        elif cleaned_name in self.alias_counter:

            if len(self.alias_counter[cleaned_name]) == 1:
                return self.alias_counter[cleaned_name][0], self.knowledge_dict[normalize_name(self.alias_counter[cleaned_name][0])]
            else:
                return cleaned_name, []
        elif cleaned_name in self.alias_counter_new_kb:
            if len(self.alias_counter_new_kb[cleaned_name]) == 1:
                return self.alias_counter_new_kb[cleaned_name][0], self.knowledge_dict[normalize_name(self.alias_counter_new_kb[cleaned_name][0])]
            else:
                return cleaned_name, []
        else:
            return cleaned_name, []

    
def normalize_name(name):
    
    if len(name) > 40:
        return name
    
    cleaned_name = name.replace("\n", " ")
    # remove all prefix and suffix
    cleaned_name = re.sub(prefix_suffix_pattern, '', unidecode(cleaned_name.lower()))
    
    
    # remove every leading and trailing commas
    cleaned_name = re.sub(r'^[^a-zA-Z]+|[^a-zA-Z]+$', '', cleaned_name)
    
    # remove everything that is not a alphabetic character and a comma
    
    cleaned_name = re.sub(r"[^a-zA-Z\s,]" ,'', cleaned_name)
    if "," in cleaned_name:
        
        # if there is a comma, there should only be one as its usually lastname, first name
        if len(cleaned_name.split(",")) > 2:
            return name
        else:
            first = cleaned_name.split(",")[0]
            last = cleaned_name.split(",")[-1]

            if len(first) == 1 or len(last) == 1:
                return name
            else:
                return last + " " + first
    
    return cleaned_name


In [None]:
corpus = []
starting_id = len(IDs)

def replace_substring(s, replacement, position, length_of_replaced):
    s = s[:position] + replacement + s[position+length_of_replaced:]
    return(s)

def find_candidate(string):
    # remove extraneous spaces in the middle
    string = string.lower().strip().replace('\n', ' ').replace('  ', ' ')
    
    # unscramble inverted names
    if ',' in string:
        string = string.split(',')[1].strip() + ' ' + string.split(',')[0].strip()
    else:
        candidate = list(set([c.entity_ for c in kb.get_entry(string)]))
        
        if len(candidate) == 0:
            if len(string.split()) >= 2 and len(string) > 5:
                return [find_from_custom_corpus(string.lower().strip().replace('\n', ' '))]
            else:
                return []
        else:
            return candidate
    
def pre_criteria(string):
    if len(string) < 2:
        return False
    else:
        return True

# additional regex patterns for those that may have been missed by the model
idnum_pattern = re.compile("(?:\+?(\d{1,3}))?[-.(]*(\d{3})[-. )]*(\d{3})[-. ]*(\d{4})(?: *x(\d+))?")
email_pattern = re.compile("([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})")

def redact(text):
    
    doc = nlp(text)
    s = text
    
    for ent in reversed(doc.ents):
        if ent.label_ == 'PERSON':
            if pre_criteria(ent.text):
                _, ID = find_candidate(ent.text)

                if len(ID) > 0:
                    
                    replacement_string = "##PERSON" + "-" + str(ID[0]) + "##"

                else:

                    replacement_string = "##PERSON##"
                
                length_of_replaced = ent.end_char - ent.start_char
                s = replace_substring(s, replacement_string, ent.start_char, length_of_replaced)
        elif ent.label_ == "IDNUM" or ent.label_ == 'LOC' or ent.label_ == 'EMAIL':

            
            replacement_string = "##" + ent.label_ + "##"
            
            length_of_replaced = ent.end_char - ent.start_char
            s = replace_substring(s, replacement_string, ent.start_char, length_of_replaced)
    
    for m in reversed(list(idnum_pattern.finditer(s))):
        length_of_replaced = m.end() - m.start()
        s = replace_substring(s, '##IDNUM##', m.start(), length_of_replaced)

    for m in reversed(list(email_pattern.finditer(s))):
        length_of_replaced = m.end() - m.start()
        s = replace_substring(s, '##EMAIL##', m.start(), length_of_replaced)
    del text, doc
    return s
    

In [None]:
kb = KnowledgeBase()

for path in tqdm(Path("all_text/").glob("*txt")):
    with open(path, 'r') as f:
        data = f.read()
    redacted = redact(data)
    
    with open(f"redacted_text/{path.stem}.txt", "w") as f:
        f.write(redacted)

Now the text is all redacted and ready for analyses!