<a href="https://wordpress-dept-chip.apps.cloudapps.unc.edu/methods-in-medical-informatics/" ><h1>Back to Notebook List</h3></a>
<br/>

Welcome to chapter four of Methods in Medical Informatics! Healthcare data often contains private and sensitive information. An important part of guaranteeing data security is deidentifying data. Manual deidentification can be a laborious and slow process. This chapter will introduce a basic, computational text scrubber. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# 15.1 Text Scrubber for Deidentifying Confidential Text

Throughout history, people have tried to remove confidential, private, offensive,
and objectionable text from documents.
Human censors do an adequate job when the data flow is small, but the amount of
sensitive information created digitally is enormous. Large hospitals create
terabytes of information every week, and a good portion of that information comes in
the form of free text. Patient medical records are confidential. Those who want to use
this information for research purposes have two options: (1) obtain informed consent
from patients to use their records (an impossible task if you want to analyze data from
thousands of human subjects), or (2) deidentify the records by removing any information
that could link the contents of a medical record to an individual patient.
In the past several decades, a variety of programs have been written that attempt to
automatically remove confidential information from medical
records. These programs are sometimes called “scrubbers”, and most of these programs
use the following algorithm:
1. Prepare lists of patient names, hospital staff names, addresses, obscenities,
objectionable hospital slang, and hospital identifier numbers.
2. Parse through the text, deleting or replacing entries from the list with noninformational
characters.
3. Match the text against a series of regex patterns that might indicate the presence
of identifying information (e.g., formalisms such as Mr., Dr., Mrs. followed
by another word, or numeric values, or date components), and remove
these strings.
These methods are the software equivalent of the human who reads through letters
and documents and marks over the objectionable parts. Parsing scripts that pass documents
through a long series of regex filters are always slow, and they never completely
remove objectionable material. They merely reduce the occurrences of objectionable
text, without eliminating the problem.

There is a better way that is essentially the reverse of censorship. You create a list
of acceptable phrases, and you parse through the text, deleting everything that is not
included on your list. This method can parse text very quickly, because it has no regex
filters. The method is potentially perfect, because the only text that appears in the final
document is text composed of words and phrases that were preapproved.*

> This script will utilize the file [doublets.txt](./K11946_Files/DOUBLETS.TXT). This is a text file containing many medical term doublets. Additional information [here](https://wordpress-dept-chip.apps.cloudapps.unc.edu/datafiles/)

**Description adapted from pages 219-20 of "Methods in Medical Informatics"*

In [2]:
#!/usr/bin/env python

import sys, re, string

doub_file = open("./K11946_Files/DOUBLETS.TXT", "r")
doub_hash = {}
for line in doub_file:
    line = line.rstrip()
    doub_hash[line] = " "
doub_file.close()

line = input("What would you like to scrub? ")
line = line.lower()
line = line.rstrip()
linearray = re.split(r' +', line)

lastword = "*"
for i in range(0, len(linearray)):
    doublet = " ".join(linearray[i:i+2])
    if doublet in doub_hash:
        print(" " + linearray[i], end=' ')
        lastword = " " + linearray[i+1]
    else:
        print(lastword, end=' ')
        lastword = " *"

    if (i == len(linearray) +1):
        print(lastword)
        sys.exit()


What would you like to scrub? doublets.txt
* 

## Script Algorithm: Text Scrubber for Deidentifying Confidential Text

In Chapter 9, Section 9.2, we created a list of word doublets from a PubMed
corpus, consisting of titles of research papers written on the subject of cancer genes. For this chapter, we created a similar doublet list. Open the file.*

In [3]:
import sys, re, string
doub_file = open("./K11946_Files/DOUBLETS.TXT", "r")

Begin your script by prompting the user to enter a sentence. The user may feel
free to enter a sentence that is offensive, incriminating, filled with the names
of people, or with sensitive information. The entered text is parsed, word doublet by word doublet, with each doublet
consisting of every word in the text followed by the next consecutive word.

In [11]:
doub_hash = {}
for line in doub_file:
    line = line.rstrip()
    doub_hash[line] = " "
doub_file.close()

line = input("What would you like to scrub? ")
line = line.lower()
line = line.rstrip()
linearray = re.split(r' +', line)

lastword = "*"

What would you like to scrub? Kevin Shang


Comparisons are made against the list of preapproved doublets (doublets.txt
in this case). Word doublets in the text that match word doublets on the list are saved.
Everything else is replaced by an asterisk.

In [4]:
for i in range(0, len(linearray)):
    doublet = " ".join(linearray[i:i+2])
    if doublet in doub_hash:
        print(" " + linearray[i], end=' ')
        lastword = " " + linearray[i+1]
    else:
        print(lastword, end=' ')
        lastword = " *"

    if (i == len(linearray) +1):
        print(lastword)
        sys.exit()

 * 

**This section is adapted from section 15.1.1, "Script Algorithm", of page 220 from "Methods in Medical Informatics".*

## Analysis: Text Scrubber for Deidentifying Confidential Text

The doublet method script, with minor modifications, can scrub any length of any text. To illustrate, I downloaded a public domain book from Project Gutenberg. I used Anomalies and Curiosities of Medicine by George M. Gould and Walter Lytle Pyle. This book has lots of medical terminology and vaguely resembles the kind of text that might be included in a pathology report. Anyone can download the same text from
http://www.gutenberg.org/etext/747
An example of output paragraph is shown below. As expected with the doublet
method, there are many blocked words. This is a limitation of the doublet method. If you use the standard list of doublets on any random book, you are bound to block some innocent doublets that were not included in the “approved” list. The only way around this limitation is to try to add safe doublets (from the text) to the approved
list.
<br>
<br><p>
In this important *, *, * * some historical *, describes a long series of experiments performed
on * in order to * the passage of *, *, *, *, *, *, * * the placenta. The placenta shows a real affinity
for * substances; in it * copper and mercury, but *, and it is therefore * it that the * * *;
in addition to its *, intestinal, and *, * * glycogen and acts as an * *, and so resembles in its
action the liver; * * of the fetus * only a potential *. * up of * in the placenta is not so general
as * of them in the liver of the mother. It may be * the placenta does not form a barrier
to the passage of * the circulation of the fetus; this would seem to * * *, which was always
found in the * never in the fetal organs. In * * lead and * accumulation of the * in the fetal
tissues is * in the maternal, perhaps from differences in * * or from greater diffusion. * it is * * barrier to the passage of *, * * * * degree of obstruction: it allows copper and * * *, * with
greater difficulty. The * toxic substances in the fetus does not follow the same * * the adult.
They * more widely in the fetus. In the * liver is the chief * *. *, which in * * to accumulate
in the liver, is in the fetus * in the skin; copper accumulates in the fetal liver, * system, and
sometimes in the skin; * which is * in the maternal liver, but also in the skin, has * in the
skin, liver, * centers, and elsewhere * *. The frequent presence of * in the fetal * its physiologic
importance. It has probably not * * influence on its *. On the * in the placenta and
nerve * * * * abortion and the birth of dead *) Copper and lead did not cause *, * * so in two
out of six *. Arsenic is a * agent in the *, * * * * *. An important * is that * * is frequently and
seriously affected in syphilis, * * the special * for the accumulation of *. * * * * * action in
this disease? The * of lead in the central nervous system of the * the frequency and serious
character of * lesions. The presence of * in the * * * an explanation of the therapeutic results
of * of this substance in skin *.</p>
<br>
<br><p>
The strengths of the doublet method are accuracy and speed. The book was deidentified in seconds).</p>

**This section is adapted from section 15.1.2, "Analysis", of pages 222-223 in "Methods in Medical Informatics".*