<h1><center><b>NBME (National Board of Medical Examiners)</b></center></h1>

NBME is a mission-driven organization that specializes in the creation of high-quality assessments and learning tools.

Mission: Protecting the Health of the Public through State of the Art Assessment

Vision: Improving Health Care around the World through Assessment

In addition to offering assessment tools for every stage of the medical school journey, NBME aims to build meaningful collaborations and make lasting contributions to the medical education community. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import pandas as pd, numpy as np

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

root = "../input/nbme-score-clinical-patient-notes/"

# **Abstract**

Medical diagnoses require extracting symptoms and characteristics of the patient from the history notes.

Learning and evaluating the ability to write patient notes requires feedback from other doctors, a time-consuming process that could be improved with the addition of machine learning.

Until recently, a clinical skills exam was taken from the United States Medical Licensing Examination.

The exam required examinees to interact with standardized patients (people trained to act out specific clinical cases) and write a note for the patient.

Trained medical raters then scored the patients' notes with rubrics that described the important concepts in each case (known as features).

**Goal**

In this competition, the goal is to identify specific clinical concepts in patient notes.

Specifically, the solution will be an automated method for mapping clinical concepts from an exam rubric (eg, "decreased appetite") to the various ways these concepts are expressed in patient clinical notes written by medical students ( eg, “eat less”, “clothes are looser”).

# **Libraries**

In [None]:
#These are the necessary libraries for data analysis and solution development
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt,json,re,nltk

# **Dataset descriptions**

The training dataset stores the **feature** records extracted from clinical notes.

The pn_num column identifies a clinical note.

The feature_num column identifies the extracted **concept/feature.**

The annotation column describes the text(s) associated with **feature_num**.

The location column represents the indices of the annotations, for example: the index 0 4 represents the first 5 letters of the clinical note.

In [None]:
import os
print("List dir")
print(os.listdir())
print("First directory")
firstdir = os.getcwd()
print(os.getcwd())

In [None]:
df_train = pd.read_csv(root+"train.csv")
df_train.head()

The features dataset contains all the **features** that can be extracted from a note, each one has its own description and identifier (**feature_num**).

In total, there are 143 features

In [None]:
df_features = pd.read_csv(root+"features.csv")
df_features.head()

In [None]:
df_features.shape

The patient notes dataset stores the clinical notes identified by **pn_num**

In [None]:
df_patient_notes = pd.read_csv(root+"patient_notes.csv")
df_patient_notes.head()

# **Annotations example**

The process of extracting a feature is divided into these steps:

1) Take note of the patient

2) Identify the feature to extract by feature_num

3) Search and find the annotation(s) corresponding to the feature

In [None]:
df_train["pn_num"].unique()[0:100]

In [None]:
pn_num = 39449

mask = df_patient_notes["pn_num"] == pn_num
df_masked = df_patient_notes[mask].copy()
print("Patient note "+str(pn_num))
print("="*64)
print(df_masked.iloc[0]["pn_history"])
print("")
print("")

df_train_masked = df_train[df_train["pn_num"] == pn_num].copy()
df_train_masked = df_train_masked.merge(df_features, on=["feature_num","case_num"])
df_train_masked.head(100)

In [None]:
df_train_masked = df_train_masked.merge(df_features, on=["feature_num","case_num"])
df_train_masked = df_train[df_train["feature_num"] == 308].copy()
df_train_masked["pn_num"].unique()

# **Tags extraction**

The proposed solution is to register keywords or tags to identify features in a text, and thus extract annotations.

The development is divided into these steps:

1) Loop through the feature dataset

2) For each feature, search for all associated annotations in the training dataset.

3) Join all the tags for each feature in a list and save them in a dictionary

In [None]:
dictionary_features = {}

In [None]:
def extraer_tags_de_anotaciones(df_train_masked):
  lista = []
  for index, row in df_train_masked.iterrows():
    if row["annotation"] != "[]":
      tags = row["annotation"].lower()
      tags = tags.replace("'", '"')
      try:
        tags = json.loads(tags)
        for tag in tags:
          if tag not in lista:
            lista.append(tag)
      except ValueError:
        print("Error al convertir: "+str(tags)+" en lista")
  return lista


def obtnener_coordenadas_tag(tag,text,contraccion_izq=0,contraccion_der=0):
  lentag = len(tag)
  index = 0
  locations = []
  while index < len(text):
      index = text.find(tag, index)
      if index == -1:
          break
      locations.append(str(index+contraccion_izq)+" "+str(index+lentag+contraccion_der))
      index += lentag
  if len(locations) == 0:
      locations = None
  else:
      locations = ';'.join(locations)
  return locations

In [None]:
for index, row in df_features.iterrows():
    #print("Feature num: "+str(row['feature_num'])+" Feature text: "+str(row["feature_text"]))
    #print("="*64)

    df_train_masked = df_train[df_train["feature_num"] == row['feature_num']].copy()
    df_train_masked = df_train_masked[["annotation"]].copy()
    tags = extraer_tags_de_anotaciones(df_train_masked)

    #print("Count tags: "+str(len(tags)))
    dictionary_features[row['feature_num']] = tags
    #print("")

# **Problems in text processing**

There were some errors during annotation processing.

They will be corrected manually and representative tags will also be added for each set. The intention is that, at the very least, the solution can merely detect the location(s) of the concepts within the notes.

In [None]:
n=0
dictionary_features[n].append('dad with "heart attack"')

n=1
dictionary_features[n].append('mom with "thyroid disease"')
dictionary_features[n].append('mom has "thyroid disorder"')
dictionary_features[n].append('mother with "thyroid"')
dictionary_features[n].append('mom with "thyroid problem"')
dictionary_features[n].append('mother-"thyroid problem"')
dictionary_features[n].append('mother thyroid "problem"')
dictionary_features[n].append('mother has "thyroid issues"')
dictionary_features[n].append('mom has "thyroid problems"')
dictionary_features[n].append('mother has "thyroid issue"')
dictionary_features[n].append('mother - "thyroid problem"')
dictionary_features[n].append('mother with "thyroid problem"')
dictionary_features[n].append('mom- "thyroid problem"')
dictionary_features[n].append('mother with "thyroid" condition"')

n=4
dictionary_features[n].append('felt ike he was going to "pass out"')

n=7
dictionary_features[n].append("couldn't catch his breath")

n=9
dictionary_features[n].append('heart pounding')
dictionary_features[n].append('heart "jumping out of his chest"')
dictionary_features[n].append('heart is "pounding"')
dictionary_features[n].append('heart racing')

n=101
dictionary_features[n].append('clothes "fit looser"')

n=102
dictionary_features[n].append("hasn't been sexually active in 9 months")
dictionary_features[n].append('not sexually active in 9 months')


n=109
dictionary_features[n].append('low appetite')
dictionary_features[n].append("didn't eat anything since last dinner")
dictionary_features[n].append("hasn't felt hungry")


n=205
dictionary_features[n].append('irregular periods')
dictionary_features[n].append('happen every 3 weeks to 4 months')
dictionary_features[n].append('periods are "unpredictable"')
dictionary_features[n].append('have been "unpredictable,"')
dictionary_features[n].append('"unpredictable" periods')
dictionary_features[n].append('periods have been "unpredictable"')
dictionary_features[n].append('now "unpredictable"')

n=206
dictionary_features[n].append('last week, nausea')
dictionary_features[n].append("last week flu-like sx's")

n=210
dictionary_features[n].append("lmp was 2 mo's ago")

n=212
dictionary_features[n].append('range from 3 weeks to 4 months and are "consistently inconsistent" in timing')

n=300
dictionary_features[n].append('uncle with "bleeding ulcer"')
dictionary_features[n].append('paternal uncle with "bleeding ulcer"')

n=301
dictionary_features[n].append('sensation in epigastric ("mid-chest") area')

n=304
dictionary_features[n].append('"burning" and "gnawing"')

n=313
dictionary_features[n].append("tums doesn't work any more")
dictionary_features[n].append("at first tums helped him  but no it isn't")
dictionary_features[n].append('tums helped at first')
dictionary_features[n].append('tums now don"t help')
dictionary_features[n].append("tum didn't improve the problem")

n=401
dictionary_features[n].append('nervousness')
dictionary_features[n].append('feels as if she is "overwhelmed"')
dictionary_features[n].append('feels as if "losing her mind"')
dictionary_features[n].append('feels "overwhelmed"')
dictionary_features[n].append('anxiousness')
dictionary_features[n].append("i feel like i'm losing my mind a bit")

n=405
dictionary_features[n].append("hasn't lost any weight")

n=408
dictionary_features[n].append("doesn't have an appetite")

n=503
dictionary_features[n].append('associated with dyspnea')
dictionary_features[n].append("associated with she] can't catch breath")

n=504
dictionary_features[n].append('feeling that "my heart is going to beat out of my chest"')
dictionary_features[n].append('palpitations')
dictionary_features[n].append('presenting for a follow up from an er visit 2 weeks ago from "heart racing"')


n=509
dictionary_features[n].append('five years ago occuring once every 2-3 months occuring "every few days for the last two weeks"')
dictionary_features[n].append('worse in frequency for the last two weeks')

n=510
dictionary_features[n].append('felt her heart "pounding" and "racing", she feels a sense of doom')
dictionary_features[n].append('sensation that she "was going to die"')
dictionary_features[n].append("feels like i'm going to die")
dictionary_features[n].append('associated feeling "something bad is going to happen."')
dictionary_features[n].append("feels like she's going to die")

n=513
dictionary_features[n].append('clammy')
dictionary_features[n].append('going from "hot"')
dictionary_features[n].append('feeling "hot"')

n=600
dictionary_features[n].append('feeling "hot"')
dictionary_features[n].append('feeling "warm"')

n=603
dictionary_features[n].append('stuffy nose for past several days')
dictionary_features[n].append('stuffy nose')
dictionary_features[n].append('"stuffy nose"')
dictionary_features[n].append('"stuffy nose" for past several days')

n=610
dictionary_features[n].append("albuterol didn't help")
dictionary_features[n].append("albuterol didn't alleviate the pain")
dictionary_features[n].append("inhaler didn't help.")

n=702
dictionary_features[n].append('menstrual irregularity')
dictionary_features[n].append('menorrhagia')
dictionary_features[n].append('tampon changes every "couple of hours"')
dictionary_features[n].append('excessive bleeding')
dictionary_features[n].append('irrgualr uterine bleeding')
dictionary_features[n].append('heavyier periods')
dictionary_features[n].append('irregulites in her period')
dictionary_features[n].append('change her tampons "every few hours"')

n=704
dictionary_features[n].append("don't use condoms")

n=706
dictionary_features[n].append("couldn't conceive")
dictionary_features[n].append("i can't get pregnant")
dictionary_features[n].append("couldn't get pregnant")

n=801
dictionary_features[n].append("son's death 3 weeks ago")
dictionary_features[n].append('son was killed 3 weeks ago')
dictionary_features[n].append("son's death")
dictionary_features[n].append("3 weeks son's death")
dictionary_features[n].append("3 weeks ago time of her son's death")

n=803
dictionary_features[n].append("auditory hallucination of neighbor's playing music last night")
dictionary_features[n].append("thought she heard voices from a neighbors' party despite that a party never occurred")
dictionary_features[n].append("heard loud 'party like noises' from neighbor\'s house 2 nights ago, but neighbor did not have a party")
dictionary_features[n].append("heard a party at her neighbor's that wasn't there")
dictionary_features[n].append("episodes of hearing a neighbor's party")
dictionary_features[n].append("heard loud music/party where there wasn't one")
dictionary_features[n].append("heard noises from neighbor's yard that were not present")
dictionary_features[n].append("heard a party going on next door which didn't really happen")

n=809
dictionary_features[n].append("can't take a nap because can't fall to slee")

n=810
dictionary_features[n].append("don't improves even when she took a ambien")
dictionary_features[n].append("tried sleeping pills but didn't help")
dictionary_features[n].append("tried ambien to sleep but it didn't help")
dictionary_features[n].append("ambien hasn't helped")

n=811
dictionary_features[n].append('feels "tired"')
dictionary_features[n].append('low energy')
dictionary_features[n].append('feeling "drained"')
dictionary_features[n].append('feels "drained"')

n=813
dictionary_features[n].append('saw her deceased son in the kitchen')
dictionary_features[n].append('"saw her son" in the living room, although she knew he was not there')
dictionary_features[n].append('saw her son at the kitchen table 4 days ago, but knows he wasn"t there')
dictionary_features[n].append('episodes of seeing "visions" of her son')
dictionary_features[n].append("seeing her son at times but she knows he isn't there")
dictionary_features[n].append('"saw leonard" 4 days ago')
dictionary_features[n].append("she saw and heard her son in the kitchen even though she knew it wasn't really him")

n=817
dictionary_features[n].append('poor sleeps')
dictionary_features[n].append("hasn't been able to get good sleep")
dictionary_features[n].append('4-5 hours of sleep every night')

n=900
dictionary_features[n].append("ibuprofen didn't help")
dictionary_features[n].append("tylenol didn't help")

n=904
dictionary_features[n].append('"head hurting" all over')

n=916
dictionary_features[n].append('feeling "warm"')

# **Ambiguous tags**

There are some tags with a length less than or equal to 2, and they can be interpreted as subtexts of words with another meaning.

In [None]:
ambiguous_tags = []
for index, row in df_features.iterrows():
    lesslength = 10000000
    for tag in dictionary_features[row['feature_num']]:
      if len(tag) < lesslength:
        lesslength = len(tag)
    if lesslength <= 2:
      print("Feature num: "+str(row['feature_num'])+" Feature text: "+str(row["feature_text"]))
      print("="*64)
      print(dictionary_features[row['feature_num']])
      print(" ")
      ambiguous_tags.append({"feature_num":row['feature_num'],"feature_text":row["feature_text"]})

df_ambiguous_tags = pd.DataFrame(ambiguous_tags)

# **Definition of feature categories**

Duplicate features exist within the feature dataset.

Since some features represent the same concepts as others, it is convenient to create feature categories to unite the tags and thus be able to carry out a more comprehensive search.

For example, in the column "feature_text" the words "Female" and "Male" are repeated

In [None]:
df_ambiguous_tags.head(10)

Categories:

* 0: It is a feature that does not need any special treatment
* 1: females, needs a global tag list and special treatment
* 2: males, needs a global tag list and special treatment
* 3: "num" - year, needs a function that generates a list of tags for each "num"

In [None]:
df_features.head(200)

In [None]:
feature_categories = [0,0,0,0,0,0,0,0,0,0,0,3,2,0,0,0,0,3,0,0,0,0,0,0,0,1,0,0,0,0,
 0,0,0,0,1,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,3,0,0,0,
 0,0,0,0,1,0,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,
 2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,3,0,0,0,1,0,0,
 3,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]

df_features["feature_cat"] = pd.Series(feature_categories)

df_features.head(200)

# **Addition of representative tags**

The tags encompass common concepts, in case of not detecting text segments, the software will at least be able to identify the main tags.

Therefore, artificial tags will be added and registered in a dataset

In [None]:
history_artificial_tags = []



feature_num = 0
additional_tags = ["heart attack","cardiac issues","cardiac issue","heart problem","heart disease","heart-attack"
,"chest-pains","chest pains","chest-pain","chest pain","coronary-infarction","coronary infarction","infarction","coronary","heart-failure",
"heart failure","congestive-heart-failure","myocardial-infarction","myocardial infarction","cardiac-arrest","cardiac arrest"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 1
additional_tags = ["thyroid problems","thyroid-problems","thyroid-problem","thyroid issues","thyroid issue",
"thyroid-issues","thyroid-issue","hypothyroidism","thyoid disorder","thyoid-disorder","thyroid disease","thyroidal",
"pituitary","thyroid-gland","thyroid gland","thyroid","parathyroid"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 2
additional_tags = ["chest-pressure"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 4
additional_tags = ["empty-headed","empty headed","delirious","pass out","passed out","dizzy"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 5
additional_tags = ["hair changes","hair chages","nail changes","nail chages","heat intolerance","temperature intolerance"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 10
additional_tags = ["few months"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 100
additional_tags = ["no vaginal discharge","no discharge"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 102
additional_tags = ["no sexual activity","no seuxual actvity"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 103
additional_tags = ["diarrhoea","constipation","collywobbles","lientery","indigestion"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 204
additional_tags = ["vaginal parchedness"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 205
additional_tags = ["irregular-menses","Irregular menses"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 206
additional_tags = ["recent nausea","recent vomiting","recent flulike"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 207
additional_tags = ["no premenstrual","denies premenstrual","no premenstrual symptoms","denies premenstrual symptoms"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 209
additional_tags = ["stress","stressed"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 212
additional_tags = ["irregular-flow","irregular-frequency","irregular frequency","irregular-intervals","Irregular intervals"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 214
additional_tags = ["heavy-sweating","heavy sweating"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 215
additional_tags = ["sleep-disturbance","sleep disturbance","early-awakenings","early awakenings","awakes at","insomnia",
                   "inability-to-sleep","insomnolence","hypersomnia","sleeplessness","restlessness","early waking"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 301
additional_tags = ["epigastric-discomfort","epigastric discomfort"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 313
additional_tags = ["tums does not work","tums do not work","tums don't work"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 403
additional_tags = ["excessive caffeine","lot of coffie","heavy caffeine"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 505
additional_tags = ["2 weeks ago workup normal"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 512
additional_tags = ["throat-tightness","throat tightness","throat tightening","throat-tightening"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 600
additional_tags = ["feeling 'hot'","feeling 'warm'"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 606
dictionary_features[feature_num].remove("cp")


feature_num = 610
additional_tags = ["albuterol no","albuterol did not","albuterol didn't"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 801
additional_tags = ["son passed away","son died"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 803
additional_tags = ["hallucination","auditory hallucination","hearing noises"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 804
additional_tags = ["tossing","turning"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 807
additional_tags = ["hallucinations after","hallucination after"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 809
additional_tags = ["unsuccessful napping"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 810
additional_tags = ["ambient didn't","ambient hasn't","ambien didn't","ambien hasn't"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 815
additional_tags = ["wakes up at"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 904
additional_tags = ["headache"]
dictionary_features[feature_num] += additional_tags
history_artificial_tags += additional_tags


feature_num = 906
dictionary_features[feature_num].remove("v")


feature_num = 908
dictionary_features[feature_num].remove("n")


feature_num = 909
dictionary_features[feature_num].remove("st")

# **Tags for female features**

This are all the "female" features.

Now i am going to join all the tags of this section of the dataset to form a general list

In [None]:
mask_1 = df_features["feature_cat"] == 1
df_features_1 = df_features[mask_1].copy()
df_features_1.head(100)

In [None]:
general_female_tags = []
for index,row in df_features_1.iterrows():
  for tag in dictionary_features[row["feature_num"]]:
    if tag not in general_female_tags and len(tag) > 1:
      general_female_tags.append(tag)
general_female_tags.append(" f ")
general_female_tags = list(dict.fromkeys(general_female_tags))
general_female_tags

# **Tags for male features**

This are all the "male" features.

Now i am going to join all the tags of this section of the dataset to form a general list

In [None]:
mask_2 = df_features["feature_cat"] == 2
df_features_2 = df_features[mask_2].copy()
df_features_2.head(100)

In [None]:
general_male_tags = []
for index,row in df_features_2.iterrows():
  for tag in dictionary_features[row["feature_num"]]:
    if tag not in general_male_tags and len(tag) > 1:
      general_male_tags.append(tag)
general_male_tags.append(" m ")
general_male_tags = list(dict.fromkeys(general_male_tags))
general_male_tags

# **Tags for num-year features**

This are all the "num"-year features.

Now i am going to join all the tags of this section of the dataset to form a general list

In [None]:
mask_3 = df_features["feature_cat"] == 3
df_features_3 = df_features[mask_3].copy()
df_features_3.head(100)

In [None]:
general_num_year_tags = []
for index,row in df_features_3.iterrows():
  for tag in dictionary_features[row["feature_num"]]:
    if tag not in general_num_year_tags and len(tag) > 1:
      general_num_year_tags.append(tag)
general_num_year_tags = list(dict.fromkeys(general_num_year_tags))
#general_num_year_tags

In [None]:
general_num_year_tags = ['xyo',
 'x y/o','x yo','x year old','x-year-old','x y.o.','x','x yr old','x y.o','xy','x years old','x y/o','x yo','x year',
 'x y','xy/o','xyo','x-year','x yr','x year old','x y o','x','x-year-old','x-year old','x y.o','x ye','x yo','x','x y.o','xyo',
 'x yo f',
 'x year old','x year','xf','x-year-old','x y/o','x years old','x y-o','x yr old','xyear','x year olf','x year old','xyo','x yo','x','x-year-old','x y.o',
 '5 year old',
 'x y',
 'x y.o.','x y/o','x yr old','x year old','x year','x-year','xyo','x yo','x yr','x y/o','x-year-old','x y o','x yo','x y.o',
 'x',
 'x yo',
 'xy',
 'x year old',
 'x','x y/o','xyo','x-year-old','x','x y','x year','x yeo','x f','x years old','xy/o','xyr','x year','x  yo','xyr old',
 'x y o',
 'x yo old','x year-old','x-yo','x yo','x-year old','x f','x y o','x  yo','x y.o','xyo','x yo','x y old','x year old','x y/o','x yr',
 'x',
 'x-year-old',
 'x y','x','x y.o.','x yr old','x y.o.','xyear old','x yo','x year-old','x yr old','x year olf','x y old','x yo','xyr old',
 'x year odl']

general_num_year_tags = list(dict.fromkeys(general_num_year_tags))
general_num_year_tags

In [None]:
def generate_num_year_tags(num):
  x = str(num)
  return [x+'yo',x+' y/o',x+' yo',x+' year old',x+'-year-old',x+' y.o.',x+' yr old',x+' y.o',x+'y',x+' years old',x+' year',x+' y',x+'y/o',x+'-year',
  x+' yr',x+' y o',x+'-year old',x+' ye',x+' yo f',x+'f',x+' y-o',x+'year',x+' year olf',x+' year old',x+' yeo',x+' f',x+'yr',x+'  yo',x+'yr old',
  x+' yo old',x+' year-old',x+'-yo',x+' y old',x+'year old',x+' year odl']

# **ALBERT model**

Obviously, category 0 features will require a model to extract the sentences in case the tag lists are not enough for the cognitive search.

In [None]:
import tensorflow as tf

from tqdm.notebook import tqdm #Useful to monitor progress in loop
tqdm.pandas()
from nltk.tokenize import word_tokenize, sent_tokenize
from transformers import PreTrainedTokenizerFast, TFAlbertModel, AlbertConfig

import re,os,random,math,pickle

In [None]:
# Input length for AlBERT model
input_length = 512
# Random seed
random_seed = 42
# List of models
MODELS = []

features = pd.read_csv(root+"features.csv")

# Additional feature_num
features['feature_num_ordinal'] = features['feature_num'].astype('category').cat.codes

#This will be the dimension of the output in the model
n_features = len(features)

In [None]:
# Predefined configuration
albert_xxlarge_config = AlbertConfig(
  hidden_size = 4096,
  intermediate_size = 16384,
  max_position_embeddings = 512,
  model_type = 'albert',
  num_attention_heads = 64,
)

albert_base_config = AlbertConfig(
  hidden_size = 768,
  intermediate_size = 3072,
  max_position_embeddings = 512,
  model_type = 'albert',
  num_attention_heads = 12,
)

In [None]:
def generate_model(file_name, config):
    # Input Layer
    input_ids = tf.keras.layers.Input(shape = (input_length), dtype=tf.int32, name='input_ids')
    attention_mask = tf.keras.layers.Input(shape=input_length, dtype=tf.int32, name='attention_mask')

    # AlBERT Model
    albert = TFAlbertModel(config)
    last_hidden_state = albert(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
    do = tf.keras.layers.Dropout(0.00, name='dropout')(last_hidden_state)

    # Output layer gives probabilities of each token to belong to each feature
    output = tf.keras.layers.Dense(n_features, activation='sigmoid', name='head/classifier')(do)

    # Final model
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=[output])
    
    # Weights
    if file_name is None:
        model.load_weights('/kaggle/input/nbme-albert-large-training-tpu-dataset/model.h5')
    else:
        model.load_weights(f'/{file_name}.h5')
    
    # Append to Models List
    MODELS.append(model)
    
    return model

In [None]:
# Clear backend
tf.keras.backend.clear_session()
# Enable XLA optmizations, it is a compiler that speeds up model training in tensorflow
tf.config.optimizer.set_jit(True)
albert_xxlarge_v18_model = generate_model(None, albert_base_config)
# Models
for m in MODELS:
    print(m.summary())
    print('\n' * 3)

In [None]:
patient_notes = pd.read_csv(root+'patient_notes.csv')
patient_notes = patient_notes.set_index(['case_num', 'pn_num'])
patient_notes['pn_history_lower'] = patient_notes['pn_history'].str.lower()
patient_notes.head()

In [None]:
# Loading tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained('../input/nbme-preprocessing-albert-public/tokenizer')

# This function tokenize the text according to a AlBERT model tokenizer
def tokenize(note):
    return tokenizer(
        note,
        padding = 'max_length',
        truncation = True,
        max_length = input_length,
        return_offsets_mapping = True,
    )

def correct_probability_in_prediction(t_dec_prev, t_dec, t_dec_next, row_feature_num, prob):    
    if row_feature_num == 70:
        if t_dec in ['ms']:
            return 1
    elif row_feature_num == 107:
        if re.fullmatch('\d{1}', t_dec) or t_dec in ['months', 'mot', 'nh', 's']:
            return 1
    elif row_feature_num == 103:
        if t_dec in ['unprotected', 'sex']:
            return 1
    elif row_feature_num == 92:
        if t_dec == 'as' and t_dec_next == 'tham':
            return 1
        elif t_dec_prev == 'as' and t_dec == 'tham':
            return 1
    elif row_feature_num == 93:
        if prob > 0.01 and t_dec in ['chest', 'pain']:
            return 1
    return prob


def find_all(a, b, offset=0):
    res = []
    # Find Ignoring Case
    start_idx = a.lower().find(b.lower())
    if start_idx != -1:
        return [offset + start_idx] + find_all(a[start_idx + len(b):], b, offset=offset + start_idx + len(b))
    else:
        return []
    
# Returns the character position in the patient note
def get_char_pos(row, return_str):
    annotation = row['t_dec']
    start_ann, end_ann = row['om']
    if start_ann == -1:
        if return_str:
             return chr(0)
        else:
            return (-1, -1)
    
    patient_note = patient_notes.loc[row['group_idx'], 'pn_history']
    
    starts_in_patient_note = find_all(patient_note, annotation)
    
    for start in starts_in_patient_note:
        end = start + len(annotation)
        if start >= start_ann - 1 and end <= end_ann + 1 and end <= len(patient_note):
            if return_str:
                 return patient_note[start:end]
            else:
                return start, end

In [None]:
threshold = 0.50
test = pd.read_csv(root+'test.csv')
test.head(100)

In [None]:
tqdm.pandas()
def process_data(dataset):
    dataset['feature_num_ordinal'] = features.set_index('feature_num').loc[dataset['feature_num'], 'feature_num_ordinal'].values
    dataset = dataset.set_index(['case_num','pn_num'])
    list_processed_data = []
    for group_idx, group in tqdm(dataset.groupby(['case_num','pn_num'])):
        # Patient Note
        pn_history_clean = patient_notes.loc[group_idx,'pn_history_lower']
        # Tokenize Patient Note
        tokens = tokenize(pn_history_clean)
    
        # Token Properties
        input_ids = tokens['input_ids']
        attention_mask = tokens['attention_mask']
        offset_mapping = tokens['offset_mapping']
    
        # Probabilities of each token belonging to each feature
        y_pred = np.zeros(shape=[n_features, input_length], dtype=np.float32)
    
        for m in MODELS:
            y_pred += m.predict_on_batch({
                'input_ids': np.array([input_ids]),
                'attention_mask': np.array([attention_mask]),
            }).squeeze().T / len(MODELS)
    
        # Iterate over all features
        for row_id, row_feature_num in group[['id','feature_num_ordinal']].itertuples(index=False, name=None):
            annotation_found = False
        
            # Prediction per Feature Number
            y_pred_row = y_pred[row_feature_num]
        
            om_pred = []
            # Iterate over all offset mappings, input tokens and prediction probabilities
            for idx, (om, t, prob) in enumerate(zip(offset_mapping, input_ids, y_pred_row)):
                # Decode Token
                t_dec = tokenizer.decode(t)
            
                t_dec_prev = tokenizer.decode(input_ids[idx - 1]) if idx > 0 else None
                t_dec_next = tokenizer.decode(input_ids[idx + 1]) if idx < len(input_ids) - 1 else None
                prob = correct_probability_in_prediction(t_dec_prev, t_dec, t_dec_next, row_feature_num, prob)
            
                # Minimum prediction threshold and token should not be utlity token (START, END, PAD etc.)
                if prob > threshold and len(t_dec) > 0 and t > 4 :
                    annotation_found = True
                    list_processed_data.append({
                        'row_id': row_id,
                        'om': om,
                        't': t,
                        't_dec': t_dec,
                        'group_idx': group_idx,
                    })
                
            # Add Empty Annotation if no annotation is found
            if not annotation_found:
                list_processed_data.append({
                    'row_id': row_id,
                    'om': (-1,-1),
                    't': -1,
                    't_dec': chr(0),
                    'group_idx': group_idx,
                })
                
    df_data = pd.DataFrame.from_dict(list_processed_data)
    df_data['om_original'] = df_data.progress_apply(get_char_pos, axis=1, return_str=False)
    df_data['om_original_str'] = df_data.progress_apply(get_char_pos, axis=1, return_str=True)
    return df_data

data_preprocessed = process_data(test)
data_preprocessed.head(100)

In [None]:
def albert_predict(processed_data):
    submission_rows = []
    for group_idx, group in tqdm(processed_data.groupby('row_id')):
        start_prev = np.NINF
        end_prev = np.NINF
        location = ''
        for start, end in group['om_original']:
            # Previous token also belongs to location, increase end location
            # i.e. 3:5 followed by 6:10 will have start 3 and end will be increased from 5 -> 10
            if end_prev + 1 >= start and start <= end_prev + 2:
                end_prev = end
            else:
                # Previous token does not belong to location, add current location
                if end_prev != np.NINF:
                    # After first location span is added, following location spans are delimited with a ";"
                    if len(location) > 0:
                        location += f';{start_prev} {end_prev}'
                    # First location span has no ";"
                    else:
                        location += f'{start_prev} {end_prev}'
                # First location span, set start and end
                if start > -1 and end > -1:
                    start_prev = start
                    end_prev = end         
        # Add last location
        if end_prev > -1:
            if len(location) > 0:
                location += f';{start_prev} {end_prev}'
            else:
                location += f'{start_prev} {end_prev}'   
        submission_rows.append({ 'id': group_idx, 'location': location })
    return submission_rows

predictions_test = pd.DataFrame.from_dict(albert_predict(data_preprocessed))
predictions_test.head(100)

# **ROBERTA Model**

In [None]:
import os,re,ast,json,glob,numpy as np,pandas as pd
from tqdm.notebook import tqdm

os.environ["TOKENIZERS_PARALLELISM"] = "false"

DATA_PATH = "../input/nbme-score-clinical-patient-notes/"
OUT_PATH = "../input/nbme-roberta-large/"
WEIGHTS_FOLDER = "../input/nbme-roberta-large/"

NUM_WORKERS = 2

def process_feature_text(text):
    text = re.sub('I-year', '1-year', text)
    text = re.sub('-OR-', " or ", text)
    text = re.sub('-', ' ', text)
    return text


def clean_spaces(txt):
    txt = re.sub('\n', ' ', txt)
    txt = re.sub('\t', ' ', txt)
    txt = re.sub('\r', ' ', txt)
#     txt = re.sub(r'\s+', ' ', txt)
    return txt


def load_and_prepare_test(root=""):
    patient_notes = pd.read_csv(root + "patient_notes.csv")
    features = pd.read_csv(root + "features.csv")
    df = pd.read_csv(root + "test.csv")

    df = df.merge(features, how="left", on=["case_num", "feature_num"])
    df = df.merge(patient_notes, how="left", on=['case_num', 'pn_num'])

    df['pn_history'] = df['pn_history'].apply(lambda x: x.strip())
    df['feature_text'] = df['feature_text'].apply(process_feature_text)

    df['feature_text'] = df['feature_text'].apply(clean_spaces)
    df['clean_text'] = df['pn_history'].apply(clean_spaces)

    df['target'] = ""
    return df


import itertools


def token_pred_to_char_pred(token_pred, offsets):
    char_pred = np.zeros((np.max(offsets), token_pred.shape[1]))
    for i in range(len(token_pred)):
        s, e = int(offsets[i][0]), int(offsets[i][1])  # start, end
        char_pred[s:e] = token_pred[i]

        if token_pred.shape[1] == 3:  # following characters cannot be tagged as start
            s += 1
            char_pred[s: e, 1], char_pred[s: e, 2] = (
                np.max(char_pred[s: e, 1:], 1),
                np.min(char_pred[s: e, 1:], 1),
            )

    return char_pred


def labels_to_sub(labels):
    all_spans = []
    for label in labels:
        indices = np.where(label > 0)[0]
        indices_grouped = [
            list(g) for _, g in itertools.groupby(
                indices, key=lambda n, c=itertools.count(): n - next(c)
            )
        ]

        spans = [f"{min(r)} {max(r) + 1}" for r in indices_grouped]
        all_spans.append(";".join(spans))
    return all_spans


def char_target_to_span(char_target):
    spans = []
    start, end = 0, 0
    for i in range(len(char_target)):
        if char_target[i] == 1 and char_target[i - 1] == 0:
            if end:
                spans.append([start, end])
            start = i
            end = i + 1
        elif char_target[i] == 1:
            end = i + 1
        else:
            if end:
                spans.append([start, end])
            start, end = 0, 0
    return spans


import numpy as np
from transformers import AutoTokenizer


def get_tokenizer(name, precompute=False, df=None, folder=None):
    if folder is None:
        tokenizer = AutoTokenizer.from_pretrained(name)
    else:
        tokenizer = AutoTokenizer.from_pretrained(folder)

    tokenizer.name = name
    tokenizer.special_tokens = {
        "sep": tokenizer.sep_token_id,
        "cls": tokenizer.cls_token_id,
        "pad": tokenizer.pad_token_id,
    }

    if precompute:
        tokenizer.precomputed = precompute_tokens(df, tokenizer)
    else:
        tokenizer.precomputed = None

    return tokenizer


def precompute_tokens(df, tokenizer):
    feature_texts = df["feature_text"].unique()

    ids = {}
    offsets = {}

    for feature_text in feature_texts:
        encoding = tokenizer(
            feature_text,
            return_token_type_ids=True,
            return_offsets_mapping=True,
            return_attention_mask=False,
            add_special_tokens=False,
        )
        ids[feature_text] = encoding["input_ids"]
        offsets[feature_text] = encoding["offset_mapping"]

    texts = df["clean_text"].unique()

    for text in texts:
        encoding = tokenizer(
            text,
            return_token_type_ids=True,
            return_offsets_mapping=True,
            return_attention_mask=False,
            add_special_tokens=False,
        )
        ids[text] = encoding["input_ids"]
        offsets[text] = encoding["offset_mapping"]

    return {"ids": ids, "offsets": offsets}


def encodings_from_precomputed(feature_text, text, precomputed, tokenizer, max_len=300):
    tokens = tokenizer.special_tokens

    # Input ids
    if "roberta" in tokenizer.name:
        qa_sep = [tokens["sep"], tokens["sep"]]
    else:
        qa_sep = [tokens["sep"]]

    input_ids = [tokens["cls"]] + precomputed["ids"][feature_text] + qa_sep
    n_question_tokens = len(input_ids)

    input_ids += precomputed["ids"][text]
    input_ids = input_ids[: max_len - 1] + [tokens["sep"]]

    # Token type ids
    if "roberta" not in tokenizer.name:
        token_type_ids = np.ones(len(input_ids))
        token_type_ids[:n_question_tokens] = 0
        token_type_ids = token_type_ids.tolist()
    else:
        token_type_ids = [0] * len(input_ids)

    # Offsets
    offsets = [(0, 0)] * n_question_tokens + precomputed["offsets"][text]
    offsets = offsets[: max_len - 1] + [(0, 0)]

    # Padding
    padding_length = max_len - len(input_ids)
    if padding_length > 0:
        input_ids = input_ids + ([tokens["pad"]] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)
        offsets = offsets + ([(0, 0)] * padding_length)

    encoding = {
        "input_ids": input_ids,
        "token_type_ids": token_type_ids,
        "offset_mapping": offsets,
    }

    return encoding


import torch
import numpy as np
from torch.utils.data import Dataset


class PatientNoteDataset(Dataset):
    def __init__(self, df, tokenizer, max_len):
        self.df = df
        self.max_len = max_len
        self.tokenizer = tokenizer

        self.texts = df['clean_text'].values
        self.feature_text = df['feature_text'].values
        self.char_targets = df['target'].values.tolist()

    def __getitem__(self, idx):
        text = self.texts[idx]
        feature_text = self.feature_text[idx]
        char_target = self.char_targets[idx]

        # Tokenize
        if self.tokenizer.precomputed is None:
            encoding = self.tokenizer(
                feature_text,
                text,
                return_token_type_ids=True,
                return_offsets_mapping=True,
                return_attention_mask=False,
                truncation="only_second",
                max_length=self.max_len,
                padding='max_length',
            )
            raise NotImplementedError("fix issues with question offsets")
        else:
            encoding = encodings_from_precomputed(
                feature_text,
                text,
                self.tokenizer.precomputed,
                self.tokenizer,
                max_len=self.max_len
            )

        return {
            "ids": torch.tensor(encoding["input_ids"], dtype=torch.long),
            "token_type_ids": torch.tensor(encoding["token_type_ids"], dtype=torch.long),
            "target": torch.tensor([0], dtype=torch.float),
            "offsets": np.array(encoding["offset_mapping"]),
            "text": text,
        }

    def __len__(self):
        return len(self.texts)

    
import spacy
import numpy as np

def plot_annotation(df, pn_num):
    options = {"colors": {}}

    df_text = df[df["pn_num"] == pn_num].reset_index(drop=True)

    text = df_text["pn_history"][0]
    ents = []

    for spans, feature_text, feature_num in df_text[["span", "feature_text", "feature_num"]].values:
        for s in spans:
            ents.append({"start": int(s[0]), "end": int(s[1]), "label": feature_text})

        options["colors"][feature_text] =  f"rgb{tuple(np.random.randint(100, 255, size=3))}"

    doc = {"text": text, "ents": sorted(ents, key=lambda i: i["start"])}

    spacy.displacy.render(doc, style="ent", options=options, manual=True, jupyter=True)

In [None]:
import torch
import transformers
import torch.nn as nn
from transformers import AutoConfig, AutoModel


class NERTransformer(nn.Module):
    def __init__(
        self,
        model,
        num_classes=1,
        config_file=None,
        pretrained=True,
    ):
        super().__init__()
        self.name = model
        self.pad_idx = 1 if "roberta" in self.name else 0

        transformers.logging.set_verbosity_error()

        if config_file is None:
            config = AutoConfig.from_pretrained(model, output_hidden_states=True)
        else:
            config = torch.load(config_file)

        if pretrained:
            self.transformer = AutoModel.from_pretrained(model, config=config)
        else:
            self.transformer = AutoModel.from_config(config)

        self.nb_features = config.hidden_size

#         self.cnn = nn.Identity()
        self.logits = nn.Linear(self.nb_features, num_classes)

    def forward(self, tokens, token_type_ids):
        """
        Usual torch forward function

        Arguments:
            tokens {torch tensor} -- Sentence tokens
            token_type_ids {torch tensor} -- Sentence tokens ids
        """
        hidden_states = self.transformer(
            tokens,
            attention_mask=(tokens != self.pad_idx).long(),
            token_type_ids=token_type_ids,
        )[-1]

        features = hidden_states[-1]

        logits = self.logits(features)

        return logits
    
    
import torch

def load_model_weights(model, filename, verbose=1, cp_folder="", strict=True):
    """
    Loads the weights of a PyTorch model. The exception handles cpu/gpu incompatibilities.

    Args:
        model (torch model): Model to load the weights to.
        filename (str): Name of the checkpoint.
        verbose (int, optional): Whether to display infos. Defaults to 1.
        cp_folder (str, optional): Folder to load from. Defaults to "".
        strict (bool, optional): Whether to allow missing/additional keys. Defaults to False.

    Returns:
        torch model: Model with loaded weights.
    """
    if verbose:
        print(f"\n -> Loading weights from {os.path.join(cp_folder,filename)}\n")

    try:
        model.load_state_dict(
            torch.load(os.path.join(cp_folder, filename), map_location="cpu"),
            strict=strict,
        )
    except RuntimeError:
        model.encoder.fc = torch.nn.Linear(model.nb_ft, 1)
        model.load_state_dict(
            torch.load(os.path.join(cp_folder, filename), map_location="cpu"),
            strict=strict,
        )

    return model

In [None]:
import torch
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm


def predict(model, dataset, data_config, activation="softmax"):
    """
    Usual predict torch function
    """
    model.eval()

    loader = DataLoader(
        dataset,
        batch_size=data_config['val_bs'],
        shuffle=False,
        num_workers=NUM_WORKERS,
        pin_memory=True,
    )

    preds = []
    with torch.no_grad():
        for data in tqdm(loader):
            ids, token_type_ids = data["ids"], data["token_type_ids"]

            y_pred = model(ids.cuda(), token_type_ids.cuda())

            if activation == "sigmoid":
                y_pred = y_pred.sigmoid()
            elif activation == "softmax":
                y_pred = y_pred.softmax(-1)

            preds += [
                token_pred_to_char_pred(y, offsets) for y, offsets
                in zip(y_pred.detach().cpu().numpy(), data["offsets"].numpy())
            ]

    return preds


def inference_test(df, exp_folder, config, cfg_folder=None):
    preds = []

    if cfg_folder is not None:
        model_config_file = cfg_folder + config.name.split('/')[-1] + "/config.pth"
        tokenizer_folder = cfg_folder + config.name.split('/')[-1] + "/tokenizers/"
    else:
        model_config_file, tokenizer_folder = None, None

    tokenizer = get_tokenizer(
        config.name, precompute=config.precompute_tokens, df=df, folder=tokenizer_folder
    )

    dataset = PatientNoteDataset(
        df,
        tokenizer,
        max_len=config.max_len,
    )

    model = NERTransformer(
        config.name,
        num_classes=config.num_classes,
        config_file=model_config_file,
        pretrained=False
    ).cuda()
    model.zero_grad()

    weights = sorted(glob.glob(exp_folder + "*.pt"))
    for weight in weights:
        model = load_model_weights(model, weight)

        pred = predict(
            model,
            dataset,
            data_config=config.data_config,
            activation=config.loss_config["activation"]
        )
        preds.append(pred)

    return preds

In [None]:
class Config:
    # Architecture
    name = "roberta-large"
    num_classes = 1

    # Texts
    max_len = 310
    precompute_tokens = True

    # Training    
    loss_config = {
        "activation": "sigmoid",
    }

    data_config = {
        "val_bs": 16 if "large" in name else 32,
        "pad_token": 1 if "roberta" in name else 0,
    }

    verbose = 1

In [None]:
def post_process_spaces(target, text):
    target = np.copy(target)

    if len(text) > len(target):
        padding = np.zeros(len(text) - len(target))
        target = np.concatenate([target, padding])
    else:
        target = target[:len(text)]

    if text[0] == " ":
        target[0] = 0
    if text[-1] == " ":
        target[-1] = 0

    for i in range(1, len(text) - 1):
        if text[i] == " ":
            if target[i] and not target[i - 1]:  # space before
                target[i] = 0

            if target[i] and not target[i + 1]:  # space after
                target[i] = 0

            if target[i - 1] and target[i + 1]:
                target[i] = 1

    return target

In [None]:
""""
!pip install GPUtil

import torch
from GPUtil import showUtilization as gpu_usage
from numba import cuda

def free_gpu_cache():
    print("Initial GPU Usage")
    gpu_usage()                             

    torch.cuda.empty_cache()

    cuda.select_device(0)
    cuda.close()
    cuda.select_device(0)

    print("GPU Usage after emptying the cache")
    gpu_usage()

free_gpu_cache()   
"""

In [None]:
import os
"""
import os
print(os.chdir("/kaggle/input/gputil/GPUtil"))
print(os.getcwd())
print(os.listdir("/kaggle/input/gputil/GPUtil"))
"""

In [None]:
#!pip install GPUtil
os.chdir("/kaggle/input/gputil/GPUtil")
import GPUtil

import torch
torch.cuda.empty_cache()

from GPUtil import showUtilization as gpu_usage
from numba import cuda

def free_gpu_cache():
    print("Initial GPU Usage")
    gpu_usage()                             

    torch.cuda.empty_cache()

    cuda.select_device(0)
    cuda.close()
    cuda.select_device(0)

    print("GPU Usage after emptying the cache")
    gpu_usage()

free_gpu_cache()

os.chdir(firstdir)

In [None]:
df_test = load_and_prepare_test(root=DATA_PATH)
display(df_test.head())

preds = inference_test(
    df_test,
    WEIGHTS_FOLDER,
    Config,
    cfg_folder=OUT_PATH
)[0]

df_test['preds'] = preds
df_test['preds'] = df_test.apply(lambda x: x['preds'][:len(x['clean_text'])], 1)

df_test['preds'] = df_test['preds'].apply(lambda x: (x > 0.5).flatten())

df_test['preds_pp'] = df_test.apply(lambda x: post_process_spaces(x['preds'], x['clean_text']), 1)

df_test['location'] = labels_to_sub(df_test['preds_pp'].values)

df_test[["id","location"]].head(20)

In [None]:
def roberta_prediction(x):
    df = x.copy()
    patient_notes = pd.read_csv(root + "patient_notes.csv")
    features = pd.read_csv(root + "features.csv")
    df = df.merge(features, how="left", on=["case_num", "feature_num"])
    df = df.merge(patient_notes, how="left", on=['case_num', 'pn_num'])
    df['pn_history'] = df['pn_history'].apply(lambda x: x.strip())
    df['feature_text'] = df['feature_text'].apply(process_feature_text)
    df['feature_text'] = df['feature_text'].apply(clean_spaces)
    df['clean_text'] = df['pn_history'].apply(clean_spaces)
    df['target'] = ""
    preds = inference_test(
        df,
        WEIGHTS_FOLDER,
        Config,
        cfg_folder=OUT_PATH
    )[0]
    df['preds'] = preds
    df['preds'] = df.apply(lambda x: x['preds'][:len(x['clean_text'])], 1)
    df['preds'] = df['preds'].apply(lambda x: (x > 0.5).flatten())
    df['preds_pp'] = df.apply(lambda x: post_process_spaces(x['preds'], x['clean_text']), 1)
    df['location'] = labels_to_sub(df['preds_pp'].values)
    return df['location'].to_list()

In [None]:
"""
dftest = []
dftest.append({"id":"00016_000","case_num":0,"pn_num":16,"feature_num":0})
dftest = pd.DataFrame(dftest)
location = roberta_prediction(dftest)[0]
location
"""

In [None]:
#!pip install GPUtil --target=/kaggle/working/GPUtil

# **Final solution**

Next, the final solution proposed by your server will be developed.

It is divided into the following steps:

1) Convert the entire clinical note to lowercase

2) Identify the category of each feature

3) If it is category 0, perform the search by tags

3) In case it is category 1, perform the search with the female list

4) In case it is category 2, perform the search with the male list

5) If it is category 3, generate the list based on age. Then perform the search with the previously made list

The dataset below is used for an evaluation of the software

In [None]:
df_test = pd.read_csv(root+"test.csv")
df_test.head()

In [None]:
mask_0 = df_features["feature_cat"] == 0
df_features_0 = df_features[mask_0].copy()
list_0 = df_features_0["feature_num"].tolist()

mask_1 = df_features["feature_cat"] == 1
df_features_1 = df_features[mask_1].copy()
list_1 = df_features_1["feature_num"].tolist()

mask_2 = df_features["feature_cat"] == 2
df_features_2 = df_features[mask_2].copy()
list_2 = df_features_2["feature_num"].tolist()

mask_3 = df_features["feature_cat"] == 3
df_features_3 = df_features[mask_3].copy()
list_3 = df_features_3["feature_num"].tolist()

In [None]:
def corregir_locaciones_irrelevantes(locations):
    locations_numbers = []
    split_locs = locations.split(";")
    for loc in split_locs:
        numbers = loc.split(" ")
        left = int(numbers[0])
        right = int(numbers[1])
        locations_numbers += np.arange(left,right+1).tolist()
    locations_numbers = list(dict.fromkeys(locations_numbers))
    locations_numbers.sort()
    result = ""
    for n in locations_numbers:
        if (result == "") or (result[-1] == ";"):
            result += str(n)+""
        if (n+1) in locations_numbers:
            continue;
        else:
            result += " "+str(n)+";"
    if result[-1] == ";":
        result = result[:-1]
    return result

In [None]:
import warnings
warnings.filterwarnings("ignore")
%matplotlib widget

In [None]:
def delete_dupicated_spaces(text):
    return re.sub(' +', ' ',text)

def count_split(text,delimiter=" "):
    return len(text.split(delimiter))

In [None]:
# generate random integer values
from random import seed
from random import randint
# seed random number generator
# generate some integers
value = randint(0, 10)
print(value)

In [None]:
def busqueda_caracteristicas(df):
  count_rows = df.shape[0]
  i = 0
  predictions = []
  for index, row in df.iterrows():
    if i%100 == 0:
        print("Progress: "+str(round((i/count_rows)*100,2)))
    mask = df_patient_notes["pn_num"] == row["pn_num"]
    df_masked = df_patient_notes[mask].copy()
    history = df_masked.iloc[0]["pn_history"]
    history = history.lower()
    locations = [] 
    if row["feature_num"] in list_0:                              #CATEGORY 0#
      for tag in dictionary_features[row["feature_num"]]:         
        locs = obtnener_coordenadas_tag(tag,history)
        if (tag in history_artificial_tags) and (locs != None):
            tag_cleaned = delete_dupicated_spaces(tag)
            if (count_split(tag_cleaned," ") <= 2) or (count_split(tag_cleaned,"-") <= 5):
                tokens = nltk.sent_tokenize(history)
                for t in tokens:
                    tokens2 = t.splitlines()
                    for t2 in tokens2:
                        if tag in t2:
                            locs_ = obtnener_coordenadas_tag(t2,history)
                            if locs_ != None:
                                locations.append(locs_)
                        
        if locs != None:
          locations.append(locs)
    elif row["feature_num"] in list_1:                           #CATEGORY 1#
      for tag in general_female_tags:
        if tag == " f ":
          locs = obtnener_coordenadas_tag(tag,history,1,-1)
        else:
          locs = obtnener_coordenadas_tag(tag,history)
        if locs != None:
          locations.append(locs)

    elif row["feature_num"] in list_2:                          #CATEGORY 2#
      for tag in general_male_tags:
        if tag == " m ":
          locs = obtnener_coordenadas_tag(tag,history,1,-1)
        else:
          locs = obtnener_coordenadas_tag(tag,history)
        if locs != None:
          locations.append(locs)

    elif row["feature_num"] in list_3:                           #CATEGORY 3#
      num = int(re.findall(r'\b\d+\b', df_features[df_features["feature_num"] == row["feature_num"]].iloc[0,2])[0])
      list_tags = generate_num_year_tags(num)
      for tag in list_tags:
        locs = obtnener_coordenadas_tag(tag,history)
        if locs != None:
          locations.append(locs)

    if (len(locations) == 0) and (row["feature_num"] in list_0):
      if randint(0, 10) % 2 == 0:
          row_1 = pd.DataFrame([{"id":row["id"],"case_num":row["case_num"],"pn_num":row["pn_num"],"feature_num":row["feature_num"]}])
          listlocation = roberta_prediction(row_1)
          locations = ";".join(listlocation)
          #row_1_processed = process_data(row_1)
          #row_1_prediction = albert_predict(row_1_processed)
          #locations = row_1_prediction[0]["location"]
          if locations == "":
            locations = np.nan
    elif len(locations) == 0:
      locations = np.nan
    else:
      locations = ";".join(locations)
      locations = corregir_locaciones_irrelevantes(locations)
    predictions.append({"id":row["id"],"location":locations})
    i += 1
  return pd.DataFrame(predictions)

In [None]:
df_preds = busqueda_caracteristicas(df_test)
df_preds.head()

In [None]:
os.chdir("/kaggle/input/")
df_preds.to_csv("../working/submission.csv",index=False)