<a href="https://colab.research.google.com/github/saraxmartin/NegExES/blob/main/Rule_based.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RULE-BASED APPROACH
Student 1: Sara Martín (NIU:1669812)

Student 2: Amelia Gomez (NIU:1631745)

Student 3: Aina Navarro (NIU:1670797)

Student 4: Lara Rodríguez (NIU: 1667906)

## Functionalities

In [1]:
!git clone https://github.com/saraxmartin/NegExES.git

Cloning into 'NegExES'...
remote: Enumerating objects: 41, done.[K
remote: Counting objects: 100% (41/41), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 41 (delta 19), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (41/41), 1.92 MiB | 3.62 MiB/s, done.
Resolving deltas: 100% (19/19), done.


In [None]:
!pip install spacy
!pip install pyspellchecker
!pip install -U spacy
!python -m spacy download es_core_news_sm

In [4]:
import os
import spacy
from spellchecker import SpellChecker
# Load the Spanish language model from spaCy
nlp = spacy.load("es_core_news_sm")

In [None]:
def correct_mispellings(words):
    """
    Correct spelling mistakes for each word in a list
    """
    # Initialize a spell checker for Spanish
    spell = SpellChecker(language="es")
    # Correct the spelling mistakes in each word
    corrected_words = []
    for word in words:
        # Tokenize the word using spaCy
        doc = nlp(word)
        corrected_word = ""
        for token in doc:
            # Check if the token is a misspelled word
            if not token.is_alpha or token.text.lower() in spell:
                corrected_word += token.text
            else:
                print(token.text.lower())
                # Correct the spelling using the spell checker
                corrected_word += spell.correction(token.text.lower())
                print(corrected_word)
        corrected_words.append(corrected_word)
    return corrected_words


# Data pre-processing


#### Create datasets for medical terminology

In [8]:
import xml.etree.ElementTree as ET

medicalterms_es = set()
medicalterms_cat = set()

directory = '/content/NegExES/data/termcat_terminologies'

for filename in os.listdir(directory):
    if filename.endswith('.xml'):
        # Construct the full path to the XML file
        file_path = os.path.join(directory, filename)

        # Parse the XML file
        tree = ET.parse(file_path)
        root = tree.getroot()

        # Iterate over each 'fitxa' element
        for fitxa in root.findall('.//fitxa'):
            # Iterate over each 'denominacio' element within the 'fitxa'
            for denominacio in fitxa.findall('denominacio'):
                if denominacio.get('llengua') == 'es':
                    medicalterms_es.add(denominacio.text)
                elif denominacio.get('llengua') == 'ca':
                    medicalterms_cat.add(denominacio.text)

medicalterms_es = list(medicalterms_es)
medicalterms_cat = list(medicalterms_cat)
print(len(medicalterms_es))
print(len(medicalterms_cat))
print(medicalterms_es[:20])
print(medicalterms_cat[:20])

17052
17394
['conducto semicircular posterior', 'rinoscopia posterior', 'vacuna antineumocócica conjugada 13-valente', 'sudoresis', 'astigmatismo contra la regla', 'índice de Barthel', 'autoacusación', 'abrebocas', 'dilución', 'cymba conchalis', 'masoterapia', 'contracción paradójica', 'plexo nervioso', 'subículo', 'radiodenso', 'flujo', 'VRE', 'neuritis óptica', 'isquion', 'idea']
['múscul constrictor superior de la faringe', 'oftàlmia', 'grànul primari', 'isquèmia retinal', 'siringomièlia', 'fibra arcuada externa', 'limbe de la fossa oval', 'deliri secundari', 'artèria capsular mitjana', 'tromofília', 'anquilosi fibrosa', 'prova unilateral', 'asèpsia', 'iatrogènic -a', 'lligaments de Cooper', 'VRE', 'degeneració hepatolenticular', 'intumescència cervical', 'idea', 'prepart']


#### Import the data (JSON)


In [None]:
import json

train_data = "/content/NegExES/data/negacio_train_v2024.json"

with open(train_data, 'r') as file:
    json_data = file.read()

data = json.loads(json_data)

# Extract text and rules from predictions
corpus = []
rules = []
neg, unc, umls = [],[],[]
neg_scopes, unc_scopes = [],[]

for item in data:
    text = item['data']['text']
    corpus.append(text)
    predictions = item['predictions']
    for prediction in predictions:
        for result in prediction['result']:
            value = result['value']
            start = value['start']
            end = value['end']
            labels = value['labels']
            extracted_text = text[start:end]
            for label in labels:
                if (label == "NEG") and (extracted_text not in neg):
                    neg.append(extracted_text)
                elif (label == "UNC") and (extracted_text not in unc):
                    unc.append(extracted_text)
                elif label == "NSCO":
                    if extracted_text not in umls:
                        umls.append(extracted_text)
                    neg_scopes.append((start,end))
                elif label == "USCO":
                    if extracted_text not in umls:
                        umls.append(extracted_text)
                    unc_scopes.append((start,end))

print(len(neg))
print(len(unc))
print(len(umls))

98
84
2955


### Preprocess Rule Tags

In [None]:
# Add rule tags we found to the ones we have from the training dataset

with open('/content/NegExES/data/extra_rules.txt', 'r') as file:
    file_extra_rules = file.read()

extra_rules = []
word, tag, is_tag = "", "", False

for char in file_extra_rules:
    if char == "\n":
        extra_rules.append((word[:-1],tag[:-1]))
        word, tag, is_tag = "", "", False
    elif char == "[":
        is_tag = True
    elif is_tag == False:
        word += char
    elif is_tag == True:
        tag += char

for rule in extra_rules:
    if (rule[1] == "PREN") or (rule[1] == "POSTN") and (rule[0] not in neg):
        neg.append(rule[0])
    elif (rule[1] == "PREP") or (rule[1] == "POSTP") and (rule[0] not in unc):
        unc.append(rule[0])

print(len(neg))
print(len(unc))


[('ausencia de', 'PREN'), ('no pueden ver', 'PREN'), ('no poder', 'PREN'), ('revisado para', 'PREN'), ('rechazado', 'PREN'), ('declina', 'PREN'), ('negado', 'PREN'), ('niega', 'PREN'), ('negando', 'PREN'), ('evaluar por', 'PREN'), ('no revela', 'PREN'), ('libre de', 'PREN'), ('negativo para', 'PREN'), ('nunca desarrollado', 'PREN'), ('nunca tuve', 'PREN'), ('no', 'PREN'), ('no anormal', 'PREN'), ('ninguna causa de', 'PREN'), ('sin quejas de', 'PREN'), ('sin evidencia', 'PREN'), ('ninguna nueva evidencia', 'PREN'), ('ninguna otra evidencia', 'PREN'), ('ninguna evidencia para sugerir', 'PREN'), ('sin hallazgos de', 'PREN'), ('no hay hallazgos para indicar', 'PREN'), ('no hay evidencia mamográfica de', 'PREN'), ('nada nuevo', 'PREN'), ('ninguna evidencia radiográfica de', 'PREN'), ('ninguna señal de', 'PREN'), ('no significativo', 'PREN'), ('sin signos de', 'PREN'), ('ninguna sugerencia de', 'PREN'), ('no sospechoso', 'PREN'), ('no', 'PREN'), ('no aparece', 'PREN'), ('no apreciar', 'PREN'

In [None]:
# Add extra medical terms
medical_keywords = ['resultado','efecto','reacción','prueba','respuesta','diagnóstico','presencia','hallazgo','función','riesgo','síntoma','indicación','tratamiento','terapia',
                    'análisis','complicación','enfermedad','condición','sensibilidad','exposición''concentración','infección','detección','alteración','nivel','signo','deficiencia',
                    'intolerancia','inmunidad','resistencia','capacidad','absorción','secuela','progresión','mejora','rechazo','eficacia','toxicidad','prevención']

for word in medical_keywords:
    umls.append(word)

In [None]:
# Clean rules - remove spaces and punctuation + correct spelling mistakes
def clean_words(words):
    words = [word.strip("!?,.;:") for word in words]
    #words = correct_mispellings(words)
    words = list(set(words))
    return words

neg = clean_words(neg)
unc = clean_words(unc)
umls = clean_words(umls)

### Preprocess Corpus

#### Correct mispelled words or lemmatise them

#### Keep or remove punctuation (depending on algorithm)

#### Split text into individual sentences (depending on algorithm)

#### Remove patient information at the start to avoid false positives (?)

#### Tokenize (keep coordinates of original text)

# NegEx algorithm

- input = sentence with indexed diseases/findings
- output = sentence with indexed negated diseases/findings

PROCESS

- one sentence per line
- remove all punctuation, don't remove stop words
- index diseases/findings by replacing phrases in text with unique string identifiers IDs from UMLS. If sentence has no UMLS term --> no search for negatives.

Ex: "*The patient denied experiencing chest pain on exertion*" --> "*The patient denied experiencing S1459038 on exertion*"

- The string matching algorithm matches the longest possible string among eligible matches in UMLS. Ex: it will match "*nonspecific viral rash*" instead of just "*rash*".

RESULTS
- They identified 35 negation phrases that could be divided in 2 groups:
1. Pseudo negation phrases: appear to indicate negation but instead identify double negatives ("not ruled out"), modified meanings ("gram-negative") and ambiguous phrasing ("unremarkable")
2. Phrases used to negate diseases/findings when used in one of this regular expressions:
- (*negation phrase*) * (**UMLS term**)
- (**UMLS term**) * (*negation phrase*)
- "*" = up to 5 tokens may fall between negation phrase and UMLS term
- Ex: "extremities showed *no* **cyanosis**, clubbing, or **edema**" matches (no) * (UMLS term) twice with both cyanosis and edema.

# NegEx algorithm implementation

Implementations:
- https://github.com/PlanTL-GOB-ES/NegEx-MES --> for spanish datasets: https://github.com/PlanTL-GOB-ES/NegEx-MES/tree/master/smn/config_files/spa
- https://github.com/chapmanbe/negex/tree/master/negex.python --> code implementation

In [None]:
def negations_and_scopes(text,neg,umls):
    """
    Detect negations and its scopes in texts from corpus
    """
    pass

In [None]:
def uncertainity_and_scopes(text,unc,umls):
    """
    Detect negations and its scopes in texts from corpus
    """
    pass

In [None]:
pred_neg, pred_neg_scopes = [],[]
pred_unc, pred_unc_scopes = [],[]

for text in corpus:
    pred_neg_, pred_neg_scopes_ = negations_and_scopes(text,neg,umls)
    pred_unc_, pred_unc_scopes_ = uncertainity_and_scopes(text,unc,umls)
    pred_neg.extend(pred_neg_)
    pred_neg_scopes.extend(pred_neg_scopes_)
    pred_unc.extend(pred_unc_)
    pred_unc_scopes.extend(pred_unc_scopes_)

## Rule-based method Evaluation