In [None]:
!pip install regex

Regular Expressions (RegEx) are a formal grammar for defining patterns in text. Once you define a RegEx pattern and compile it, you can apply it to a document to check if that pattern exists and if so, extract all or specific parts of it.

This makes it really useful for extracting information from a document that is either highly structured itself (such as a radiology report produced with a standard template), or where the target text has a predictable format regardless of the overall format of the document, such as finding the Gleason score in a pathology report.

---

First we import the regex python library, and another one called pretty print which will help format some of the things we print out so they look nicer.

In [None]:
import regex
import pprint
pp = pprint.PrettyPrinter(indent=2)

Email addresses are a good example of text that has a predictable format regardless of the overall format of the parent document. Here's a RegEx that will match valid email addresses:

In [None]:
email_regex = r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b"

Breaking that down, we have:

***r"..."***: The r prefix on a string tells python to consider it as raw characters and not try to interpret it. Python might have it's own interpretation of what "\b" means, and in this case we want RegEx to make that determination, not Python

***\b***: This is a RegEx *token* that indicates we want to limit our matches to text that occurs at *word boundaries*, so not in the middle of a word. A word boundary can be the start of a new line, text after a space, or preceding punctuation marks.

***[A-Z0-9._%+-]***: The square brackets indicate that we are defining a *character class*, and the contents of the square brackets are the characters that we want to match. Ranges are separated by a hyphen, so *A-Z* means we'll accept any character between A-Z, and similarly for *0-9*. We will also accept the literal characters ".", "_", "%", "+", and "-", which are all valid characters for the start of an email address.

***+***: The plus sign after the character class means we require one or more characters from the class to match the target for it to be considered valid.

So far, our RegEx would match the prefix of an email address, so "twloehfelm", "thomas.loehfelm", and "t_w_loehfelm2", but not the whole thing.

***@***: The "@" outside of a character class means we requre that exact character to occur. All email addresses follow the pattern "prefix@domain", so "@" will always be found in a valid email address.

***[A-Z0-9.-]+***: Similar to the prefix, we'll accept any of these characters and require at least one of them (notice the *+* at the end).

***\\.***: We require there to be a single period preceding the final component of the domain.

***[A-Z]{2,}***: This is the final part of the domain (i.e. "com", "org", "io"), and is required to be two or more alpha characters

In [None]:
email_validator = regex.compile(email_regex, regex.IGNORECASE)

In [None]:
text = """
If you have any questions please email me at twloehfelm@ucdavis.edu, 
twloehfelm@gmail.com, or thomas.loehfelm@panorad.io and I'll respond 
as soon as I can."""


Once we've compiled a RegEx pattern in Python, we can search a document to find all of the matching text:

In [None]:
email_validator.findall(text)

---
Here's an function that will extract TI-RADS assignments from our Thyroid ultrasound reports.

Note that we are defining a dictionary object of {key: value} pairs where the key is an integer from 1-5, and the value is a RegEx pattern we are searching for. In this case, we are looking for text that exactly matches the template text, so the RegEx pattern is literally just the text we want to find.

Note that we first replace newline characters ("\n") with blank spaces (" ") because Powerscribe automatically inserts newlines to limit the line length of our reports. The TI-RADS assignment then could be "TR4 - Moderately\nsuspicious." in come cases. Replacing all newlines with spaces fixes that issue.

We are storing each TI-RADS mention in a list, keeping track of that the highest TI-RADS score is in the entire document, and then saving all of that output into a dictionary object of {key: value} pairs where the keys are "max", and "all", the value of "max" is an integer from 0-5, and the value of "all" is the list of mentions.

In [None]:
def extract_tirads(report):
    """
    Uses RegEx matching to find TIRADS classifications in a report.
    """
    tirads_mentions = []
    max_tirads = 0
    tr_scores = {
        1: "TR1 - Benign.",
        2: "TR2 - Not suspicious.",
        3: "TR3 - Mildly suspicious.",
        4: "TR4 - Moderately suspicious.",
        5: "TR5 - Highly suspicious."
    }
    report = report.replace("\n", " ")
    for tr_score in tr_scores:
        for match in regex.finditer("("+tr_scores[tr_score]+")", report, regex.M):
            tirads = tr_score
            if tirads > max_tirads: max_tirads = tirads
            first_pos = match.start()
            last_pos = match.end()
            mention = {"tirads": tirads, "first_pos": first_pos, "last_pos": last_pos}
            tirads_mentions.append(mention)
    
    return {"max": max_tirads, "all": tirads_mentions}

In [None]:
report = """
US THYROID / THYROIDECTOMY\nCOMPARISON: None\n\nINDICATION: Signs/symptoms: Neck swelling  Suspected dx/hx: Other, specify:\nunknown  Comments:\n\nTECHNIQUE: Multiple transverse and sagittal grayscale and color Doppler\nsonographic images of the thyroid gland were obtained.\n\nFINDINGS:\nEvaluation limited by inferior position of the thyroid.\n\nRight lobe: 3.7 x 2.0 x 1.6 cm (5.6 cc), heterogeneous parenchyma.\nNodules:\n1. Mid lobe, 1.4 x 1.0 x 1.2 cm, solid or almost completely solid (2\npts), isoechoic (1 pt), wider-than-tall (0 pt), with smooth margins (0 pt).\nMacrocalcifications: absent (0 pt). Peripheral rim calcification: absent (0\npt). Punctate echogenic foci: none (0 pt). TR3 - Mildly suspicious.\n2. Inferior pole, 2.0 x 1.7 x 1.6 cm, solid or almost completely solid\n(2 pts), hypoechoic (2 pts), wider-than-tall (0 pt), with smooth margins (0\npt). Macrocalcifications: absent (0 pt). Peripheral rim calcification:\nabsent (0 pt). Punctate echogenic foci: none (0 pt). TR4 - Moderately\nsuspicious.\n3. Additional subcentimeter nodule that does not meet criteria for FNA\nor follow-up.\n\nLeft lobe: 2.4 x 1.4 x 1.3 cm (2.1 cc), heterogeneous parenchyma.\nNodules:\n1. Inferior pole, 0.9 x 1.0 x 0.9 cm, solid or almost completely solid\n(2 pts), hypoechoic (2 pts), wider-than-tall (0 pt), with smooth margins (0\npt). Macrocalcifications: absent (0 pt). Peripheral rim calcification:\nabsent (0 pt). Punctate echogenic foci: none (0 pt). TR3 - Mildly\nsuspicious.\n2. Additional subcentimeter nodules do not meet criteria for FNA or\nfollow-up.\n\nIsthmus: 0.2 cm.\nNodules: none.\n\nNo abnormal cervical lymph nodes. \n\nRight submandibular gland: 3.5 x 1.2 x 3.4 cm. Normal echotexture.\nProminent salivary duct without obstructing stone.\nLeft submandibular gland: 3.0 x 1.6 x 3.3 cm. Normal echotexture.\n\nIMPRESSION:\n1. Mildly dilated right submandibular salivary duct. The remainder of\nthe gland is not impressive for an inflammatory process, but consider\nsialadenitis as a cause of the patient's right neck pain. No obstructing\nstones.\n2. Multiple bilateral thyroid nodules. Only the 2 cm right inferior\npole TR-4 nodule meets criteria for FNA or follow-up. Please see\nrecommendations below.\n3. Heterogenous thyroid parenchyma, suggestive of chronic thyroiditis.\n\n\nACR TI-RADS Consensus Recommendations:\nTR-5 - Highly suspicious: FNA if > 1 cm, follow if > 0.5 cm\nTR-4 - Moderately suspicious: FNA > 1.5 cm, follow if > 1 cm\nTR-3 - Mildly suspicious: FNA if > 2.5 cm, follow if > 1.5 cm\nTR-2 - Not suspicious: FNA is not recommended\nTR-1 - Benign: FNA is not recommended\n\nReference: ACR Thyroid Imaging, Reporting and Data System (TI-RADS): White\nPaper of the ACR TI-RADS Committee. Tessler et al. J Am Coll Radiol. 2017\nMay;14(5):587-595. PMID:28372962.\n
"""

In [None]:
tirads = extract_tirads(report)

In [None]:
pp.pprint(tirads)

That was a pretty simple RegEx pattern, but then can grow to be arbirtarily complex depending on your use case. Extracting Gleason scores turns out to be pretty complicated because Pathologists, just like Radiologists, don't always use the exact same format. Sometimes it is:
- Gleason 3+4
- Gleason 3 + 4
- Gleason's 3+4
- Gleason's score: 3+4
etc.

Here's my current version of a RegEx pattern that seems to match all of the various ways Gleason scores are mentioned in Pathology reports at UC Davis over the last few years:

In [None]:
gleason = "(?ei)(?:gleason|adenocarcinoma|histologic|primary)(?:[\s's,])*(?:combined|(?:\(predominant\))*)*(?:\s)*(?:score|grade|pattern)*(?:[s\s\(\:])*(?:primary(?: pattern)*)*\s*(?:\:)*(?:grade)*(?:\:)*\s*(?:grade|pattern)*\s*([1-5])(?:[\+\-\.\s\]*)*(?:secondary)*\s*(?:\(worst remaining\))*(?:pattern|grade)*(?:[:\s])*(?:grade|pattern)*(?:[:\s])*([1-5])"

gleason_regex = regex.compile(gleason)

And here's an example Pathology report. We can apply our Gleason RegEx to it to find all of the mentions of Gleason score, store each one, and keep track of the worst Gleason score found in the whole report:

In [None]:
text = """
Gleason 3+4
Gleason 4 + 5
Gleason's 3+4
Gleason's score: 2 + 3
Gleasons score: 5+5
"""

In [None]:
max_gleason = 0
for match in regex.finditer(gleason_regex, text):
    major = int(match.group(1))
    minor = int(match.group(2))
    max_gleason = max(max_gleason, major+minor)
    if major+minor >= 7:
        print("   csPCA: %i+%i" % (major, minor))
    else:
        print("No csPCa: %i+%i" % (major, minor))
print("\nMax Gleason score: %i" % (max_gleason))