**Analítica de datos en salud**

Presentado por:

* 2400452 - Jennifer Benavides Castillo
* 2400479 - Cristhian David Cruz Millán
* 2400794 - Sergio Alejandro Fierro Ospitia
* 2400478 - Edwin Andrés Lasso Rosero

#Entregable 2:
Aplicar un modelo NER preentrenado sobre historias clínicas para identificar y clasificar entidades médicas específicas relacionadas con el cáncer de pulmón.

##Sirve para:

* Cargar un modelo Transformer previamente entrenado en tareas clínicas (NER) desde Hugging Face.

* Tokenizar y procesar automáticamente textos clínicos libres (oraciones) para extraer entidades como medicamentos, tratamientos, fechas, conceptos oncológicos, etc.

* Generar salidas estructuradas con las entidades reconocidas y sus etiquetas correspondientes.

* Interpretar resultados en forma de tabla, útil para análisis posteriores o visualización.

##Se puede utilizar en:

* Automatización del análisis de historias clínicas de pacientes con cáncer de pulmón.

* Generación de bases de datos estructuradas a partir de textos clínicos no estructurados.

* Apoyo al diagnóstico, seguimiento y documentación clínica mediante herramientas basadas en IA.

In [None]:
### Validación: Uso del modelo de cáncer de pulmón para hacer predicciones de nuevas oraciones.

In [None]:
!pip install transformers[torch]
!pip install accelerate

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<2.7,>=2.1->transformers[torch])
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<2.7,>=2.1->transformers[torch])
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch<2.7,>=2.1->transformers[torch])
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch<2.7,>=2.1->transformers[torch])
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch<2.7,>=2.1->transformers[torch])
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch<2.7,>=2.1->transformers[torch])
 

In [None]:
pip install tqdm



In [None]:
from huggingface_hub import login
login("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")

In [None]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import torch.nn.functional as F
from tqdm import tqdm


In [None]:
### Diccionario con las etiquetas usadas en el modelo
id2label = {
    0: 'B-CANCER_CONCEPT',
    1: 'B-CHEMOTHERAPY',
    2: 'B-DATE',
    3: 'B-DRUG',
    4: 'B-FAMILY',
    5: 'B-FREQ',
    6: 'B-IMPLICIT_DATE',
    7: 'B-INTERVAL',
    8: 'B-METRIC',
    9: 'B-OCURRENCE_EVENT',
    10: 'B-QUANTITY',
    11: 'B-RADIOTHERAPY',
    12: 'B-SMOKER_STATUS',
    13: 'B-STAGE',
    14: 'B-SURGERY',
    15: 'B-TNM',
    16: 'I-CANCER_CONCEPT',
    17: 'I-DATE',
    18: 'I-DRUG',
    19: 'I-FAMILY',
    20: 'I-FREQ',
    21: 'I-IMPLICIT_DATE',
    22: 'I-INTERVAL',
    23: 'I-METRIC',
    24: 'I-OCURRENCE_EVENT',
    25: 'I-SMOKER_STATUS',
    26: 'I-STAGE',
    27: 'I-SURGERY',
    28: 'I-TNM',
    29: 'O'
}

#label2id = {v: k for k, v in id2label.items()}
num_labels = len(id2label)


In [None]:
print(num_labels)

30


In [None]:
# Cargar modelo y tokenizer
# Se carga el modelo entrenado previamente
hugging_face_NER_model="jenniferbc/bert-base-uncased-finetuned-ner-lung"

model = AutoModelForTokenClassification.from_pretrained(hugging_face_NER_model,
        num_labels = num_labels,
        id2label = id2label,
        label2id = {v: k for k, v in id2label.items()}
)

tokenizer = AutoTokenizer.from_pretrained(hugging_face_NER_model, use_fast = True)


# Usar GPU si está disponible
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)



all_results = []
batch_size = 8


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
texts = [
    "Adenocarcinoma de pulmón derecho cT3 (contacta pleura) cN3 (supraclavicular derecha) cM0, estadio IIIB.",
    "En resumen, se trata de una paciente mujer de 70 años con antecedentes de desprendimiento de retina en OI, degeneración macular bilateral y trastorno depresivo, nunca fumadora.",
    "Acude con excelente estado general, asintomática.",
    "SERVICIO: Oncología Radioterápica FECHA INGRESO: 30/04/2017 19:27 FECHA ALTA: 8/5/2017.",
    "(IPA aproximado de 110 paquetes/año)",
    "No otras enfermedades médicas de interés. Intervenciones quirúrgicas: várices extremidad inferior izquierda (2005).",
    "PARACETAMOL 1 GR COMPRIMIDOS: 1 comprimido cada 8 horas si precisa por fiebre o dolor.",
    "HISTORIA ONCOLÓGICA: Diagnosticado en septiembre de 2016 de carcinoma de pulmón tipo mixto de lóbulo superior izdo (adenocarcinoma neuroendocrino de células gigantes) cT3N0M0.",
    "AP: Pieza de lobectomía (LSI) con carcinoma neuroendocrino de célula grande (60%) combinado con adenocarcinoma acinar (10%) y carcinoma pleomorfo con células gigantes (30%). Estadio patológico (TNM 7ª ed): pT3N2L1V1R0.",
    "Recibe tratamiento adyuvante con quimioterapia segun esquema Carboplatino: 4 ciclos entre el 12.12.2016 y 20.02.2017."
]
        #Ensayar con 10 oraciones

# Tokenización
encodings = tokenizer(
        texts,
        truncation=True,
        padding=True,
        return_offsets_mapping=True,
        return_attention_mask=True,
        return_token_type_ids=False,
        max_length=512,
        is_split_into_words=False
        )



In [None]:
input_ids = torch.tensor(encodings["input_ids"]).to(device)

attention_mask = torch.tensor(encodings["attention_mask"]).to(device)


with torch.no_grad():
 outputs = model(input_ids=input_ids, attention_mask=attention_mask)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
probs = F.softmax(logits, dim=-1)

In [None]:
print (predictions)

tensor([[29,  0,  0, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 15, 28, 29, 29, 29,
         29, 29, 29, 29, 15, 28, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 15, 28,
         29, 13, 26, 26, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
         29, 29,  0, 16,  0, 29, 16, 16, 16, 16, 16, 26, 29, 29, 29, 29, 29, 29,
         29, 29, 29, 29, 29, 15, 29, 29,  0, 29, 29, 29, 29,  0,  0,  0, 16, 29,
         29, 29, 16],
        [29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 10,  8,  8,
         29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
         29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 12, 29,
         29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
         29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
         29, 29, 29],
        [29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,
         29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 

In [None]:
#print (probs[0])

In [None]:
### Para cada oracion en la lista de oraciones.
for i, text in enumerate(texts):
  word_ids = encodings.word_ids(batch_index=i)
  tokens = tokenizer.convert_ids_to_tokens(encodings["input_ids"][i])
  print("\n \n=================================================================================================\n")
  print(word_ids)
  print(tokens)

  previous_word_id = None
  aligned_words, aligned_labels, aligned_scores = [], [], []

  for token, label_id, word_id in zip(tokens, predictions[i].tolist(), word_ids):
    #print(token, " ", label_id, " ", word_id)

    if word_id is None:
      continue

    if word_id != previous_word_id:
      aligned_words.append(token.replace("▁", ""))
      aligned_labels.append(id2label[label_id])
      aligned_scores.append(probs[i][word_id][label_id].item())
    else:
      aligned_words[-1] += token.replace("▁", "")
    previous_word_id = word_id

  filtered_results = [
            (word, label, score)
            for word, label, score in zip(aligned_words, aligned_labels, aligned_scores)
            if label != "O"
  ]

  ###Resultados
  print("\n ")
  print("Palabras: ", aligned_words)
  print("Labels: ", aligned_labels)
  print("Score: ", aligned_scores)
  print("\n ")



 

[None, 0, 0, 0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 9, 9, 10, 11, 11, 11, 11, 11, 12, 12, 12, 13, 14, 14, 15, 16, 17, 17, 18, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
['[CLS]', 'aden', '##oca', '##rc', '##ino', '##ma', 'de', 'pu', '##lm', '##on', 'der', '##ech', '##o', 'ct', '##3', '(', 'contact', '##a', 'pl', '##eur', '##a', ')', 'cn', '##3', '(', 'su', '##pr', '##ac', '##lav', '##icular', 'der', '##ech', '##a', ')', 'cm', '##0', ',', 'estadio', 'iii', '##b', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[P

In [None]:
import pandas as pd

# Para cada oración en texts
for i, text in enumerate(texts):
    word_ids = encodings.word_ids(batch_index=i)
    tokens = tokenizer.convert_ids_to_tokens(encodings["input_ids"][i])

    previous_word_id = None
    aligned_words, aligned_labels, aligned_scores = [], [], []

    for token, label_id, word_id in zip(tokens, predictions[i].tolist(), word_ids):
        if word_id is None:
            continue

        if word_id != previous_word_id:
            aligned_words.append(token.replace("▁", ""))
            aligned_labels.append(id2label[label_id])
            aligned_scores.append(probs[i][word_id][label_id].item())
        else:
            aligned_words[-1] += token.replace("▁", "")
        previous_word_id = word_id

    # Filtrar etiquetas distintas de "O"
    filtered_results = [
        (word, label, score)
        for word, label, score in zip(aligned_words, aligned_labels, aligned_scores)
        if label != "O"
    ]

    # Crear DataFrame
    df_resultados = pd.DataFrame(filtered_results, columns=["Palabra", "Etiqueta", "Score"])

    print(f"\n\n======= Resultados para oración {i+1} =======")
    print("Texto original:", text)
    display(df_resultados)  # En Jupyter o Colab, usa display() para mejor formato




Texto original: Adenocarcinoma de pulmón derecho cT3 (contacta pleura) cN3 (supraclavicular derecha) cM0, estadio IIIB.


Unnamed: 0,Palabra,Etiqueta,Score
0,aden##oca##rc##ino##ma,B-CANCER_CONCEPT,0.0001066309
1,de,I-CANCER_CONCEPT,1.451446e-05
2,pu##lm##on,I-CANCER_CONCEPT,0.01007999
3,der##ech##o,I-CANCER_CONCEPT,0.9662098
4,ct##3,B-TNM,0.0001848784
5,cn##3,B-TNM,5.125432e-06
6,cm##0,B-TNM,0.02397145
7,estadio,B-STAGE,1.427807e-06
8,iii##b,I-STAGE,7.316231e-07




Texto original: En resumen, se trata de una paciente mujer de 70 años con antecedentes de desprendimiento de retina en OI, degeneración macular bilateral y trastorno depresivo, nunca fumadora.


Unnamed: 0,Palabra,Etiqueta,Score
0,70,B-QUANTITY,1.641972e-06
1,an##os,B-METRIC,6.197315e-06
2,fu##mad##ora,B-SMOKER_STATUS,7.64116e-07




Texto original: Acude con excelente estado general, asintomática.


Unnamed: 0,Palabra,Etiqueta,Score




Texto original: SERVICIO: Oncología Radioterápica FECHA INGRESO: 30/04/2017 19:27 FECHA ALTA: 8/5/2017.


Unnamed: 0,Palabra,Etiqueta,Score
0,fe##cha,B-OCURRENCE_EVENT,3e-06
1,ing##res##o,I-OCURRENCE_EVENT,6e-06
2,30,B-DATE,1e-06
3,/,I-DATE,2e-06
4,04,B-DATE,1e-06
5,/,I-DATE,2e-06
6,2017,I-DATE,3e-06
7,27,B-QUANTITY,0.000258
8,fe##cha,B-OCURRENCE_EVENT,5.2e-05
9,alta,I-OCURRENCE_EVENT,0.998778




Texto original: (IPA aproximado de 110 paquetes/año)


Unnamed: 0,Palabra,Etiqueta,Score
0,ipa,B-METRIC,2e-06
1,110,B-QUANTITY,3e-06
2,pa##quet##es,B-METRIC,8.3e-05
3,an##o,B-FREQ,3e-06




Texto original: No otras enfermedades médicas de interés. Intervenciones quirúrgicas: várices extremidad inferior izquierda (2005).


Unnamed: 0,Palabra,Etiqueta,Score
0,2005,B-DATE,5.781336e-07




Texto original: PARACETAMOL 1 GR COMPRIMIDOS: 1 comprimido cada 8 horas si precisa por fiebre o dolor.


Unnamed: 0,Palabra,Etiqueta,Score
0,para##ce##tam##ol,B-DRUG,0.000466
1,1,B-QUANTITY,4.4e-05
2,gr,B-METRIC,0.000396
3,1,B-QUANTITY,0.999695
4,com##pr##imi##do,B-METRIC,0.999509
5,cad##a,B-FREQ,0.000279
6,8,I-FREQ,0.000659
7,ho##ras,I-FREQ,0.106576




Texto original: HISTORIA ONCOLÓGICA: Diagnosticado en septiembre de 2016 de carcinoma de pulmón tipo mixto de lóbulo superior izdo (adenocarcinoma neuroendocrino de células gigantes) cT3N0M0.


Unnamed: 0,Palabra,Etiqueta,Score
0,diagnostic##ado,B-OCURRENCE_EVENT,8e-06
1,en,I-OCURRENCE_EVENT,6.1e-05
2,sept##ie##mbre,B-DATE,2e-06
3,de,I-DATE,3e-06
4,2016,I-DATE,1.5e-05
5,car##cino##ma,B-CANCER_CONCEPT,0.000109
6,de,I-CANCER_CONCEPT,5e-06
7,pu##lm##on,I-CANCER_CONCEPT,5e-06
8,aden##oca##rc##ino##ma,B-CANCER_CONCEPT,1.7e-05
9,ne##uro##end##oc##rino,I-CANCER_CONCEPT,0.999081




Texto original: AP: Pieza de lobectomía (LSI) con carcinoma neuroendocrino de célula grande (60%) combinado con adenocarcinoma acinar (10%) y carcinoma pleomorfo con células gigantes (30%). Estadio patológico (TNM 7ª ed): pT3N2L1V1R0.


Unnamed: 0,Palabra,Etiqueta,Score
0,lobe##ct##omi##a,B-SURGERY,2.2e-05
1,car##cino##ma,B-CANCER_CONCEPT,0.000228
2,ne##uro##end##oc##rino,I-CANCER_CONCEPT,0.000299
3,de,I-CANCER_CONCEPT,8.3e-05
4,ce##lu##la,I-CANCER_CONCEPT,0.000166
5,grande,I-CANCER_CONCEPT,4.3e-05
6,%,B-METRIC,1.5e-05
7,aden##oca##rc##ino##ma,B-CANCER_CONCEPT,0.000224
8,ac##ina##r,I-CANCER_CONCEPT,0.997657
9,10,B-QUANTITY,5e-06




Texto original: Recibe tratamiento adyuvante con quimioterapia segun esquema Carboplatino: 4 ciclos entre el 12.12.2016 y 20.02.2017.


Unnamed: 0,Palabra,Etiqueta,Score
0,qui##mi##ote##ra##pia,B-CHEMOTHERAPY,2.6e-05
1,car##bo##pl##atin##o,B-DRUG,1e-05
2,4,B-QUANTITY,4e-06
3,ci##cl##os,B-METRIC,2e-06
4,12,B-DATE,0.000159
5,.,I-DATE,0.001658
6,12,I-DATE,5e-06
7,.,I-DATE,1e-06
8,2016,I-DATE,1e-06
9,20,B-DATE,0.000443
