In [2]:
%%javascript
$(document).ready(function() {
    var cells = Jupyter.notebook.get_cells();
    for(var i in cells) {
        var cell = cells[i];
        var tags = cell.metadata.tags;
        if (tags && tags.indexOf('hide-from-students') >= 0) {
            cell.element.hide();
            cell.execute()
        }
    }
});

<IPython.core.display.Javascript object>

  <div>
    <h1 align="center">Excercise 01 - Medical Information Retrieval 2023</h1>
  </div>
  <br />

## Regular Expression <a class="anchor" id="first"></a>

Regular Expressions (or regex) are a powerful tool for searching strings based on a variety of rules. For instance, you can use them to locate all capital letters in a string or to identify all negative numbers in a document.

Despite their flexibility, regular expressions are known for their peculiar syntax. This is due to the fact that regular expressions must be able to filter out any possible string pattern you can think of, hence the need for a complex string pattern format.

The built-in re library in Python is used to handle regular expressions.

Let's begin with regex by learning how to search for basic patterns within a string.

### Simple string matching search <a class="anchor" id="first-1"></a>

One way to search for specific information within text data is by using simple string matching. This involves searching for a specific sequence of characters within a string, such as a particular word or phrase. For example, we might search for the word "heart" within a clinical patient note.

Import Packages

In [3]:
import re
import pandas as pd
import numpy as np

The following is a clinical patient note. In the process of the exercise, we will try out different NLP techniques on a data set of 40,000 such notes.

In [4]:
text = """
17-year-old male, has come to the student health clinic complaining of heart pounding. Mr. Cleveland's mother has given verbal consent for a history, physical examination, and treatment
-began 2-3 months ago,sudden,intermittent for 2 days(lasting 3-4 min),worsening,non-allev/aggrav
-associated with dispnea on exersion and rest,stressed out about school
-reports fe feels like his heart is jumping out of his chest
-ros:denies chest pain,dyaphoresis,wt loss,chills,fever,nausea,vomiting,pedal edeam, heart pounding strong
-pmh:non,meds :aderol (from a friend),nkda
"""

In the following we will discover the methods match(), search() and findall() from the package re


Simple search if a string appeirs in the text

In [5]:
string = "heart"

string in text

True

Position of such string.

In [6]:
match = re.search(string, text)
match.span()

(72, 77)

Find not only the first appeirance (mabe it appeirs more often).

In [7]:
# how often does the string "heart" appeir in the text? 
num = len(re.findall("heart", text))

num

3

Now find all appeirances and their exact position span

In [8]:
iterator = re.finditer("heart", text)  

for match in iterator:
    print(match.span())

(72, 77)
(383, 388)
(502, 507)


### Pattern matching search <a class="anchor" id="first-2"></a>

Pattern matching allows for more complex and varied searches within text data. It involves the use of regular expressions, or regex, which are sequences of characters that define a search pattern. Regex can be used to search for specific patterns of characters within a string, such as numerical values expressed in different formats.

For example, we might search for ages within a string. An age could be expressed as an integer followed by a key word such as "years" or "age".

Examples of simple pattern are listed in the following table:

<table ><tr><th>Character</th><th>Description</th>

<tr ><td><span >\d</span></td><td>A digit</td>

<tr ><td><span >\w</span></td><td>Alphanumeric</td>



<tr ><td><span >\s</span></td><td>White space</td>



<tr ><td><span >\D</span></td><td>A non digit</td>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td>

<tr ><td><span >\S</span></td><td>Non-whitespace</td>

For example, if we want to find the first integer in the patient note:

In [9]:
match = re.search(r'\d', text)
if match is not None:
    print(match.span())

(1, 2)


Now a bit more complex: We are looking for a pattern that could indicate the age of the patient. Therefore we are looking for a positive number followed by the strings "year", "old", or "age" within the next 16 characters. After we found such pattern, we want to extract the age out of it.

In [10]:
result = re.findall(r"\d+.{0,16}?(?:year|old|age)", text)
print(result)

['17-year']


Let's break down this regular expression:

- `\d+`: Matches one or more digits for the positive number.
- `.{0,16}?`: Matches any character (except for a newline) between zero and five times, but as few times as possible to allow the rest of the regular expression to match.
- `(?:year|old|age)`: Non-capturing group that matches one of the specified strings: "year", "old", or "age".

So, this regular expression will match a positive number followed by the strings "year", "old", or "age" within 16 characters. For example, it would match "10 is his age", "7 decades old", or "34 years old".

If something found, extract the number out of it:

In [11]:
if len(result)>0:
    age = int(re.search(r"\d+", text).group())
print(age)

17


### Exercise <a class="anchor" id="first-3"></a>

You have been provided with a dataset of clinical notes which contains information about patients' health conditions. Your task is to extract if the patient mentioned is using caffeine (drinking coffee, taking pills etc. with caffeine?) in each note using regular expressions.

* Use regular expressions to extract the caffeine use of the patient mentioned in each note. This information can be represented in different formats (e.g. "drinks coffee", "cafeine use", "hyper active" etc.), so you will need to create a complex regular expression that can capture all these variations.

* Create a function that takes in a clinical note as input and returns a binary expression if the patient uses caffeine  in the note using your regular expression.

Lets load the notes from a csv-file and have a look:

In [12]:
df = pd.read_csv("data/01-regex/train.csv")
df.head()

Unnamed: 0,pn_history,label
0,Dillon Cleveland is a 17 yo M who presents to ...,0
1,HPI: 26 year old female c/o palpitations for 3...,0
2,HPI: Dillon Cleveland is a 17 yo M with a 3-4 ...,1
3,26 yo F in the clinic for follow up after epis...,0
4,HPI: Mr. Cleveland is a 17 yo m that presents ...,0


The Dataframe consists of the clinical notes and a binary label. 0 means that there is no description if the patient uses caffeine. 1 means that the doktor mentions it in some way at least once in the note.
Now write a function that takes one note as input and predicts the label 0 or 1:

In [13]:
def pred_caffeine_use(note):
    ### your code ###
    return 'cof' in note
    
    
    # This is just an example:
    result = re.findall("cof", note)
    return len(result)>0


In [14]:
####  DO NOT CHANGE ####

import pandas as pd
import numpy as np

def test_results(fnc):
    df_text = pd.read_csv("data/01-regex/test.csv")
    notes, labels = df_text.values.T
    label_preds = [fnc(note) for note in notes]

    # accuracy
    accuracy = sum([1 if label_preds[i] == labels[i] else 0 for i in range(len(labels))])/len(labels)

    # F1 score
    tp = sum([1 if label_preds[i] == labels[i] == 1 else 0 for i in range(len(labels))])
    fp = sum([1 if label_preds[i] == 1 and labels[i] == 0 else 0 for i in range(len(labels))])
    fn = sum([1 if label_preds[i] == 0 and labels[i] == 1 else 0 for i in range(len(labels))])
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)

    f1 = 2*precision*recall/(precision+recall)

    return {'Accuracy': np.round(accuracy, 3), 'F1-Score': np.round(f1, 3)}

In [15]:
# Lets test your function on the test set!
test_results(pred_caffeine_use)

{'Accuracy': 0.889, 'F1-Score': 0.545}