# Exact Text Matcher
The contest provided a substantial amount of data to classify, and a comparatively tiny amount of training data. Our intended approach to classify the raw data relies on some form of neural network implementation. Generally, neural networks improve performance when given more training data to work with. While we don't have a medical expert handy to manually label more data to get more training data intelligently, we do have the next best thing. We can make Python code that gets a large amount of training data that we can be sure is accurate very, very unintelligently.

The following code identifies every pattern that is known to indicate a diagnosable feature, finds all instances of those patterns in the raw data, and labels them to expand the available training data.

In [390]:
import pandas as pd # for data handling
import re # for string manipulation

Importing the necessary dataframes:

In [391]:
path = 'nbme-score-clinical-patient-notes/'
features = pd.read_csv(path + '/features.csv')
pn = pd.read_csv(path + '/patient_notes.csv')
train = pd.read_csv(path + '/train.csv')

Building a dictionary of text identifiers for every feature:

In [392]:
feature_identifiers = {} # Keys are feature numbers, values are sets full of every text annotation labeled as that feature number
# loops through every feature to collect information
for feature_id in features['feature_num'].unique():
    identifying_text = set() # set to hold unique identifiers for current diagnosable feature
    feature_annotations = train[train['feature_num'] == feature_id]['annotation'].unique().tolist() # gets all unique lists of identifiers
    # loops through all lists to retrieve the individual unique text identifiers
    for note_annotation in feature_annotations:
        matchable_texts = eval(note_annotation)
        for text in matchable_texts:
            identifying_text.add(text) # adds the string to the set
    feature_identifiers[feature_id] = identifying_text

Removing patient notes that have already been labeled to avoid redundant processing:

In [393]:
already_annotated = train['pn_num'].unique().tolist()
trimmed_notes = pn[~pn['pn_num'].isin(already_annotated)]

Initializing a dataframe for building the extension of training data:

In [394]:
train_extension = pd.DataFrame(columns=['id', 'case_num', 'pn_num', 'feature_num', 'annotation', 'unformatted_location'])

In [None]:
# Loop for patient note entry
for idx, row in trimmed_notes.iterrows():
    text = row['pn_history'] # The full text of the current patient note
    # Loop for each feature to compare it to the note
    for feature_num in feature_identifiers.keys():
        # Loop for identifying text
        feature_texts = list(sorted(feature_identifiers[feature_num], key= len, reverse=True)) # orders texts by descending length
        feature_spans = [] # holds spans to be used in expanded data set
        feature_annotations = [] # holds annotations matched in the text


        for matchable_text in feature_texts:
            # Avoids short strings like the "f" annotation being found in almost every patient note
            pattern = re.compile(fr"(?<!\w){matchable_text}(?!\w)")
            new_spans = [(m.start(), m.end()) for m in re.finditer(pattern, text)]

            # Processes found spans
            if len(new_spans) > 0:
                no_conflict = True
                # Checks if the found spans overlap with any already-known spans
                for new_span in new_spans:
                    for old_span in feature_spans:
                        if new_span[0] > old_span[0] and new_span[1] < old_span[1]:
                                no_conflict = False
                    # If the new span doesn't overlap already known spans, it's added to the list of known spans
                    if no_conflict:
                        feature_spans.append(new_span)
                        feature_annotations.append(matchable_text)

        # Creates a row entry in the train_extension dataframe representing the found data
        if len(feature_spans) > 0:
            train_extension.loc[train_extension.shape[0]] = {'case_num': row['case_num'], 'pn_num': row['pn_num'], 'feature_num': feature_num, \
                                                             'annotation': feature_annotations, 'unformatted_location': feature_spans}

The steps taken to find and compare all of these new spans resulted in lists of integers representing spans, e.g. [(1, 5), (10, 13)]

The data in the train dataframe is, somewhat inexplicably, formatted as string representations of lists of strings without commas, e.g. "['1 5', '10 13']"

Perhaps there's some reason for this that will become clear later when we're dealing with the transformer neural network. Perhaps not. In case it matters, the following code formats the new column accordingly.

In [397]:
def format_location(entry):
    span_list = entry['location']
    stringy_list = "["

    for span in span_list:
        string_rep = "'{start:n} {end:n}', ".format(start = span[0], end = span[1])
        stringy_list = stringy_list + string_rep
    stringy_list = stringy_list[:-2] + ']'
    return stringy_list

In [398]:
train_extension['unformatted_location'] = train_extension.apply(format_location, axis=1)

In [410]:
train_extension.head()

Unnamed: 0,id,case_num,pn_num,feature_num,annotation,unformatted_location,location
0,,0,0,0,[father had MI],"[(553, 566)]",['553 566']
1,,0,0,2,[chest pain],"[(432, 442)]",['432 442']
2,,0,0,3,[intermittent],"[(216, 228)]",['216 228']
3,,0,0,6,[aderol],"[(520, 526)]",['520 526']
4,,0,0,9,[heart pounding],"[(71, 85)]",['71 85']


In [411]:
train_extension.to_csv(path + '/train_extension.csv')